Computation and Language 98
☆ Understanding the Limits of Lifelong Knowledge Editing in LLMs
Keeping large language models factually up-to-date is crucial for deployment,
yet costly retraining remains a challenge. Knowledge editing offers a promising
alternative, but methods are only tested on small-scale or synthetic edit
benchmarks. In this work, we aim to bridge research into lifelong knowledge
editing to real-world edits at practically relevant scale. We first introduce
WikiBigEdit; a large-scale benchmark of real-world Wikidata edits, built to
automatically extend lifelong for future-proof benchmarking. In its first
instance, it includes over 500K question-answer pairs for knowledge editing
alongside a comprehensive evaluation pipeline. Finally, we use WikiBigEdit to
study existing knowledge editing techniques' ability to incorporate large
volumes of real-world facts and contrast their capabilities to generic
modification techniques such as retrieval augmentation and continual finetuning
to acquire a complete picture of the practical extent of current lifelong
knowledge editing.
comment: Preprint
☆ Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning
Combining existing pre-trained expert LLMs is a promising avenue for scalably
tackling large-scale and diverse tasks. However, selecting experts at the task
level is often too coarse-grained, as heterogeneous tasks may require different
expertise for each instance. To enable adaptive instance-level mixing of
pre-trained LLM experts, we propose Symbolic-MoE, a symbolic, text-based, and
gradient-free Mixture-of-Experts framework. Symbolic-MoE takes a fine-grained
approach to selection by emphasizing skills, e.g., algebra in math or molecular
biology in biomedical reasoning. We propose a skill-based recruiting strategy
that dynamically selects the most relevant set of expert LLMs for diverse
reasoning tasks based on their strengths. Each selected expert then generates
its own reasoning, resulting in k outputs from k experts, which are then
synthesized into a final high-quality response by an aggregator chosen based on
its ability to integrate diverse reasoning outputs. We show that Symbolic-MoE's
instance-level expert selection improves performance by a large margin but --
when implemented naively -- can introduce a high computational overhead due to
the need for constant model loading and offloading. To address this, we
implement a batch inference strategy that groups instances based on their
assigned experts, loading each model only once. This allows us to integrate 16
expert models on 1 GPU with a time cost comparable to or better than prior
multi-agent baselines using 4 GPUs. Through extensive evaluations on diverse
benchmarks (MMLU-Pro, GPQA, AIME, and MedMCQA), we demonstrate that
Symbolic-MoE outperforms strong LLMs like GPT4o-mini, as well as multi-agent
approaches, with an absolute average improvement of 8.15% over the best
multi-agent baseline. Moreover, Symbolic-MoE removes the need for expensive
multi-round discussions, outperforming discussion baselines with less
computation.
comment: The first three authors contributed equally. Project Page:
https://symbolic_moe.github.io/
☆ Learning LLM Preference over Intra-Dialogue Pairs: A Framework for Utterance-level Understandings
Xuanqing Liu, Luyang Kong, Wei Niu, Afshin Khashei, Belinda Zeng, Steve Johnson, Jon Jay, Davor Golac, Matt Pope
Large language models (LLMs) have demonstrated remarkable capabilities in
handling complex dialogue tasks without requiring use case-specific
fine-tuning. However, analyzing live dialogues in real-time necessitates
low-latency processing systems, making it impractical to deploy models with
billions of parameters due to latency constraints. As a result, practitioners
often prefer smaller models with millions of parameters, trained on
high-quality, human-annotated datasets. Yet, curating such datasets is both
time-consuming and costly. Consequently, there is a growing need to combine the
scalability of LLM-generated labels with the precision of human annotations,
enabling fine-tuned smaller models to achieve both higher speed and accuracy
comparable to larger models. In this paper, we introduce a simple yet effective
framework to address this challenge. Our approach is specifically designed for
per-utterance classification problems, which encompass tasks such as intent
detection, dialogue state tracking, and more. To mitigate the impact of
labeling errors from LLMs -- the primary source of inaccuracies in student
models -- we propose a noise-reduced preference learning loss. Experimental
results demonstrate that our method significantly improves accuracy across
utterance-level dialogue tasks, including sentiment detection (over $2\%$),
dialogue act classification (over $1.5\%$), etc.
comment: 7 pages, 4 figures
☆ A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
Large Language Models (LLMs) have revolutionized natural language processing,
yet their internal mechanisms remain largely opaque. Recently, mechanistic
interpretability has attracted significant attention from the research
community as a means to understand the inner workings of LLMs. Among various
mechanistic interpretability approaches, Sparse Autoencoders (SAEs) have
emerged as a particularly promising method due to their ability to disentangle
the complex, superimposed features within LLMs into more interpretable
components. This paper presents a comprehensive examination of SAEs as a
promising approach to interpreting and understanding LLMs. We provide a
systematic overview of SAE principles, architectures, and applications
specifically tailored for LLM analysis, covering theoretical foundations,
implementation strategies, and recent developments in sparsity mechanisms. We
also explore how SAEs can be leveraged to explain the internal workings of
LLMs, steer model behaviors in desired directions, and develop more transparent
training methodologies for future models. Despite the challenges that remain
around SAE implementation and scaling, they continue to provide valuable tools
for understanding the internal mechanisms of large language models.
comment: 20 pages, 3 figures
☆ R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, Ji-Rong Wen
Existing Large Reasoning Models (LRMs) have shown the potential of
reinforcement learning (RL) to enhance the complex reasoning capabilities of
Large Language Models~(LLMs). While they achieve remarkable performance on
challenging tasks such as mathematics and coding, they often rely on their
internal knowledge to solve problems, which can be inadequate for
time-sensitive or knowledge-intensive questions, leading to inaccuracies and
hallucinations. To address this, we propose \textbf{R1-Searcher}, a novel
two-stage outcome-based RL approach designed to enhance the search capabilities
of LLMs. This method allows LLMs to autonomously invoke external search systems
to access additional knowledge during the reasoning process. Our framework
relies exclusively on RL, without requiring process rewards or distillation for
a cold start. % effectively generalizing to out-of-domain datasets and
supporting both Base and Instruct models. Our experiments demonstrate that our
method significantly outperforms previous strong RAG methods, even when
compared to the closed-source GPT-4o-mini.
☆ Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
Shiping Yang, Jie Wu, Wenbiao Ding, Ning Wu, Shining Liang, Ming Gong, Hengyuan Zhang, Dongmei Zhang
Robustness has become a critical attribute for the deployment of RAG systems
in real-world applications. Existing research focuses on robustness to explicit
noise (e.g., document semantics) but overlooks spurious features (a.k.a.
implicit noise). While previous works have explored spurious features in LLMs,
they are limited to specific features (e.g., formats) and narrow scenarios
(e.g., ICL). In this work, we statistically confirm the presence of spurious
features in the RAG paradigm, a robustness problem caused by the sensitivity of
LLMs to semantic-agnostic features. Moreover, we provide a comprehensive
taxonomy of spurious features and empirically quantify their impact through
controlled experiments. Further analysis reveals that not all spurious features
are harmful and they can even be beneficial sometimes. Extensive evaluation
results across multiple LLMs suggest that spurious features are a widespread
and challenging problem in the field of RAG. The code and dataset will be
released to facilitate future research. We release all codes and data at:
$\\\href{https://github.com/maybenotime/RAG-SpuriousFeatures}{https://github.com/maybenotime/RAG-SpuriousFeatures}$.
☆ Evaluating open-source Large Language Models for automated fact-checking
The increasing prevalence of online misinformation has heightened the demand
for automated fact-checking solutions. Large Language Models (LLMs) have
emerged as potential tools for assisting in this task, but their effectiveness
remains uncertain. This study evaluates the fact-checking capabilities of
various open-source LLMs, focusing on their ability to assess claims with
different levels of contextual information. We conduct three key experiments:
(1) evaluating whether LLMs can identify the semantic relationship between a
claim and a fact-checking article, (2) assessing models' accuracy in verifying
claims when given a related fact-checking article, and (3) testing LLMs'
fact-checking abilities when leveraging data from external knowledge sources
such as Google and Wikipedia. Our results indicate that LLMs perform well in
identifying claim-article connections and verifying fact-checked stories but
struggle with confirming factual news, where they are outperformed by
traditional fine-tuned models such as RoBERTa. Additionally, the introduction
of external knowledge does not significantly enhance LLMs' performance, calling
for more tailored approaches. Our findings highlight both the potential and
limitations of LLMs in automated fact-checking, emphasizing the need for
further refinements before they can reliably replace human fact-checkers.
comment: Main: 10 pages, 13 figures. Supplementary Materials: 7 pages, 29
figures, 1 table ### This work has been submitted to the IEEE for possible
publication. ###
☆ Pi-GPS: Enhancing Geometry Problem Solving by Unleashing the Power of Diagrammatic Information
Geometry problem solving has garnered increasing attention due to its
potential applications in intelligent education field. Inspired by the
observation that text often introduces ambiguities that diagrams can clarify,
this paper presents Pi-GPS, a novel framework that unleashes the power of
diagrammatic information to resolve textual ambiguities, an aspect largely
overlooked in prior research. Specifically, we design a micro module comprising
a rectifier and verifier: the rectifier employs MLLMs to disambiguate text
based on the diagrammatic context, while the verifier ensures the rectified
output adherence to geometric rules, mitigating model hallucinations.
Additionally, we explore the impact of LLMs in theorem predictor based on the
disambiguated formal language. Empirical results demonstrate that Pi-GPS
surpasses state-of-the-art models, achieving a nearly 10\% improvement on
Geometry3K over prior neural-symbolic approaches. We hope this work highlights
the significance of resolving textual ambiguity in multimodal mathematical
reasoning, a crucial factor limiting performance.
☆ Cognitive Bias Detection Using Advanced Prompt Engineering
Cognitive biases, systematic deviations from rationality in judgment, pose
significant challenges in generating objective content. This paper introduces a
novel approach for real-time cognitive bias detection in user-generated text
using large language models (LLMs) and advanced prompt engineering techniques.
The proposed system analyzes textual data to identify common cognitive biases
such as confirmation bias, circular reasoning, and hidden assumption. By
designing tailored prompts, the system effectively leverages LLMs' capabilities
to both recognize and mitigate these biases, improving the quality of
human-generated content (e.g., news, media, reports). Experimental results
demonstrate the high accuracy of our approach in identifying cognitive biases,
offering a valuable tool for enhancing content objectivity and reducing the
risks of biased decision-making.
comment: 17 pages. 6 Figures, 2 Tables
☆ Statistical Guarantees of Correctness Coverage for Medical Multiple-Choice Question Answering
Large language models (LLMs) are increasingly deployed in real-world
question-answering (QA) applications. However, LLMs have been proven to
generate hallucinations and nonfactual information, undermining their
trustworthiness in high-stakes medical tasks. Conformal prediction (CP) is
well-known to be model-agnostic and distribution-free, which creates
statistically rigorous prediction sets in classification tasks. In this work,
we for the first time adapt the CP framework to medical multiple-choice
question-answering (MCQA) tasks, by correlating the nonconformity score with
the frequency score of correct options grounded in self-consistency theory,
assuming no access to internal model information. Considering that the adapted
CP framework can only control the (mis)coverage rate, we employ a risk control
framework, which can manage task-specific metrics by devising a monotonically
decreasing loss function. We evaluate our framework on 3 popular medical MCQA
datasets utilizing 4 ``off-the-shelf'' LLMs. Empirical results demonstrate that
we achieve user-specified average (or marginal) error rates on the test set.
Furthermore, we observe that the average prediction set size (APSS) on the test
set decreases as the risk level increases, which concludes a promising
evaluation metric for the uncertainty of LLMs.
comment: Under Review
☆ EuroBERT: Scaling Multilingual Encoders for European Languages
Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, Gabriel Hautreux, João Alves, Kevin El-Haddad, Manuel Faysse, Maxime Peyrard, Nuno M. Guerreiro, Patrick Fernandes, Ricardo Rei, Pierre Colombo
General-purpose multilingual vector representations, used in retrieval,
regression and classification, are traditionally obtained from bidirectional
encoder models. Despite their wide applicability, encoders have been recently
overshadowed by advances in generative decoder-only models. However, many
innovations driving this progress are not inherently tied to decoders. In this
paper, we revisit the development of multilingual encoders through the lens of
these advances, and introduce EuroBERT, a family of multilingual encoders
covering European and widely spoken global languages. Our models outperform
existing alternatives across a diverse range of tasks, spanning multilingual
capabilities, mathematics, and coding, and natively supporting sequences of up
to 8,192 tokens. We also examine the design decisions behind EuroBERT, offering
insights into our dataset composition and training pipeline. We publicly
release the EuroBERT models, including intermediate training checkpoints,
together with our training framework.
comment: 26 pages, 6 figures, 11 tables
☆ Benchmarking LLMs in Recommendation Tasks: A Comparative Evaluation with Conventional Recommenders
In recent years, integrating large language models (LLMs) into recommender
systems has created new opportunities for improving recommendation quality.
However, a comprehensive benchmark is needed to thoroughly evaluate and compare
the recommendation capabilities of LLMs with traditional recommender systems.
In this paper, we introduce RecBench, which systematically investigates various
item representation forms (including unique identifier, text, semantic
embedding, and semantic identifier) and evaluates two primary recommendation
tasks, i.e., click-through rate prediction (CTR) and sequential recommendation
(SeqRec). Our extensive experiments cover up to 17 large models and are
conducted across five diverse datasets from fashion, news, video, books, and
music domains. Our findings indicate that LLM-based recommenders outperform
conventional recommenders, achieving up to a 5% AUC improvement in the CTR
scenario and up to a 170% NDCG@10 improvement in the SeqRec scenario. However,
these substantial performance gains come at the expense of significantly
reduced inference efficiency, rendering the LLM-as-RS paradigm impractical for
real-time recommendation environments. We aim for our findings to inspire
future research, including recommendation-specific model acceleration methods.
We will release our code, data, configurations, and platform to enable other
researchers to reproduce and build upon our experimental results.
☆ KIEval: Evaluation Metric for Document Key Information Extraction
Document Key Information Extraction (KIE) is a technology that transforms
valuable information in document images into structured data, and it has become
an essential function in industrial settings. However, current evaluation
metrics of this technology do not accurately reflect the critical attributes of
its industrial applications. In this paper, we present KIEval, a novel
application-centric evaluation metric for Document KIE models. Unlike prior
metrics, KIEval assesses Document KIE models not just on the extraction of
individual information (entity) but also of the structured information
(grouping). Evaluation of structured information provides assessment of
Document KIE models that are more reflective of extracting grouped information
from documents in industrial settings. Designed with industrial application in
mind, we believe that KIEval can become a standard evaluation metric for
developing or applying Document KIE models in practice. The code will be
publicly available.
☆ Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts
Linear Sequence Modeling (LSM) like linear attention, state space models and
linear RNNs, and Mixture-of-Experts (MoE) have recently emerged as significant
architectural improvements. In this paper, we introduce Linear-MoE, a
production-level system for modeling and training large-scale models that
integrate LSM with MoE. Linear-MoE leverages the advantages of both LSM modules
for linear-complexity sequence modeling and MoE layers for sparsely activation,
aiming to offer high performance with efficient training. The Linear-MoE system
comprises: 1) Modeling subsystem, which provides a unified framework supporting
all instances of LSM. and 2) Training subsystem, which facilitates efficient
training by incorporating various advanced parallelism technologies,
particularly Sequence Parallelism designed for Linear-MoE models. Additionally,
we explore hybrid models that combine Linear-MoE layers with standard
Transformer-MoE layers with its Sequence Parallelism to further enhance model
flexibility and performance. Evaluations on two model series, A0.3B-2B and
A1B-7B, demonstrate Linear-MoE achieves efficiency gains while maintaining
competitive performance on various benchmarks, showcasing its potential as a
next-generation foundational model architecture. Code:
https://github.com/OpenSparseLLMs/Linear-MoE.
comment: Technical report, 17 pages
☆ An Empirical Study of Conformal Prediction in LLM with ASP Scaffolds for Robust Reasoning
In this paper, we examine the use of Conformal Language Modelling (CLM)
alongside Answer Set Programming (ASP) to enhance the performance of standard
open-weight LLMs on complex multi-step reasoning tasks. Using the StepGame
dataset, which requires spatial reasoning, we apply CLM to generate sets of ASP
programs from an LLM, providing statistical guarantees on the correctness of
the outputs. Experimental results show that CLM significantly outperforms
baseline models that use standard sampling methods, achieving substantial
accuracy improvements across different levels of reasoning complexity.
Additionally, the LLM-as-Judge metric enhances CLM's performance, especially in
assessing structurally and logically correct ASP outputs. However, calibrating
CLM with diverse calibration sets did not improve generalizability for tasks
requiring much longer reasoning steps, indicating limitations in handling more
complex tasks.
☆ Multi Agent based Medical Assistant for Edge Devices
Sakharam Gawade, Shivam Akhouri, Chinmay Kulkarni, Jagdish Samant, Pragya Sahu, Aastik, Jai Pahal, Saswat Meher
Large Action Models (LAMs) have revolutionized intelligent automation, but
their application in healthcare faces challenges due to privacy concerns,
latency, and dependency on internet access. This report introduces an ondevice,
multi-agent healthcare assistant that overcomes these limitations. The system
utilizes smaller, task-specific agents to optimize resources, ensure
scalability and high performance. Our proposed system acts as a one-stop
solution for health care needs with features like appointment booking, health
monitoring, medication reminders, and daily health reporting. Powered by the
Qwen Code Instruct 2.5 7B model, the Planner and Caller Agents achieve an
average RougeL score of 85.5 for planning and 96.5 for calling for our tasks
while being lightweight for on-device deployment. This innovative approach
combines the benefits of ondevice systems with multi-agent architectures,
paving the way for user-centric healthcare solutions.
☆ Leveraging Semantic Type Dependencies for Clinical Named Entity Recognition
Previous work on clinical relation extraction from free-text sentences
leveraged information about semantic types from clinical knowledge bases as a
part of entity representations. In this paper, we exploit additional evidence
by also making use of domain-specific semantic type dependencies. We encode the
relation between a span of tokens matching a Unified Medical Language System
(UMLS) concept and other tokens in the sentence. We implement our method and
compare against different named entity recognition (NER) architectures (i.e.,
BiLSTM-CRF and BiLSTM-GCN-CRF) using different pre-trained clinical embeddings
(i.e., BERT, BioBERT, UMLSBert). Our experimental results on clinical datasets
show that in some cases NER effectiveness can be significantly improved by
making use of domain-specific semantic type dependencies. Our work is also the
first study generating a matrix encoding to make use of more than three
dependencies in one pass for the NER task.
☆ Shifting Perspectives: Steering Vector Ensembles for Robust Bias Mitigation in LLMs ACL 2025
We present a novel approach to bias mitigation in large language models
(LLMs) by applying steering vectors to modify model activations in forward
passes. We employ Bayesian optimization to systematically identify effective
contrastive pair datasets across nine bias axes. When optimized on the BBQ
dataset, our individually tuned steering vectors achieve average improvements
of 12.2%, 4.7%, and 3.2% over the baseline for Mistral, Llama, and Qwen,
respectively. Building on these promising results, we introduce Steering Vector
Ensembles (SVE), a method that averages multiple individually optimized
steering vectors, each targeting a specific bias axis such as age, race, or
gender. By leveraging their collective strength, SVE outperforms individual
steering vectors in both bias reduction and maintaining model performance. The
work presents the first systematic investigation of steering vectors for bias
mitigation, and we demonstrate that SVE is a powerful and computationally
efficient strategy for reducing bias in LLMs, with broader implications for
enhancing AI safety.
comment: Submitted to ACL 2025
☆ Chain of Strategy Optimization Makes Large Language Models Better Emotional Supporter
Weixiang Zhao, Xingyu Sui, Xinyang Han, Yang Deng, Yulin Hu, Jiahe Guo, Libo Qin, Qianyun Du, Shijin Wang, Yanyan Zhao, Bing Qin, Ting Liu
The growing emotional stress in modern society has increased the demand for
Emotional Support Conversations (ESC). While Large Language Models (LLMs) show
promise for ESC, they face two key challenges: (1) low strategy selection
accuracy, and (2) preference bias, limiting their adaptability to emotional
needs of users. Existing supervised fine-tuning (SFT) struggles to address
these issues, as it rigidly trains models on single gold-standard responses
without modeling nuanced strategy trade-offs. To overcome these limitations, we
propose Chain-of-Strategy Optimization (CSO), a novel approach that optimizes
strategy selection preferences at each dialogue turn. We first leverage Monte
Carlo Tree Search to construct ESC-Pro, a high-quality preference dataset with
turn-level strategy-response pairs. Training on ESC-Pro with CSO improves both
strategy accuracy and bias mitigation, enabling LLMs to generate more
empathetic and contextually appropriate responses. Experiments on LLaMA-3.1-8B,
Gemma-2-9B, and Qwen2.5-7B demonstrate that CSO outperforms standard SFT,
highlighting the efficacy of fine-grained, turn-level preference modeling in
ESC.
comment: 19 pages, 9 figures, 15 tables
☆ Improving Hate Speech Classification with Cross-Taxonomy Dataset Integration ACL
Algorithmic hate speech detection faces significant challenges due to the
diverse definitions and datasets used in research and practice. Social media
platforms, legal frameworks, and institutions each apply distinct yet
overlapping definitions, complicating classification efforts. This study
addresses these challenges by demonstrating that existing datasets and
taxonomies can be integrated into a unified model, enhancing prediction
performance and reducing reliance on multiple specialized classifiers. The work
introduces a universal taxonomy and a hate speech classifier capable of
detecting a wide range of definitions within a single framework. Our approach
is validated by combining two widely used but differently annotated datasets,
showing improved classification performance on an independent test set. This
work highlights the potential of dataset and taxonomy integration in advancing
hate speech detection, increasing efficiency, and ensuring broader
applicability across contexts.
comment: Accepted for publication at LaTeCH-CLfL 2025. The 9th Joint ACL
Special Interest Group on Language Technologies for the Socio-Economic
Sciences and Humanities (SIGHUM) Workshop on Computational Linguistics for
Cultural Heritage, Social Sciences, Humanities and Literature
☆ GEMA-Score: Granular Explainable Multi-Agent Score for Radiology Report Evaluation
Zhenxuan Zhang, Kinhei Lee, Weihang Deng, Huichi Zhou, Zihao Jin, Jiahao Huang, Zhifan Gao, Dominic C Marshall, Yingying Fang, Guang Yang
Automatic medical report generation supports clinical diagnosis, reduces the
workload of radiologists, and holds the promise of improving diagnosis
consistency. However, existing evaluation metrics primarily assess the accuracy
of key medical information coverage in generated reports compared to
human-written reports, while overlooking crucial details such as the location
and certainty of reported abnormalities. These limitations hinder the
comprehensive assessment of the reliability of generated reports and pose risks
in their selection for clinical use. Therefore, we propose a Granular
Explainable Multi-Agent Score (GEMA-Score) in this paper, which conducts both
objective quantification and subjective evaluation through a large language
model-based multi-agent workflow. Our GEMA-Score parses structured reports and
employs NER-F1 calculations through interactive exchanges of information among
agents to assess disease diagnosis, location, severity, and uncertainty.
Additionally, an LLM-based scoring agent evaluates completeness, readability,
and clinical terminology while providing explanatory feedback. Extensive
experiments validate that GEMA-Score achieves the highest correlation with
human expert evaluations on a public dataset, demonstrating its effectiveness
in clinical scoring (Kendall coefficient = 0.70 for Rexval dataset and Kendall
coefficient = 0.54 for RadEvalX dataset). The anonymous project demo is
available at: https://github.com/Zhenxuan-Zhang/GEMA_score.
☆ AutoIOT: LLM-Driven Automated Natural Language Programming for AIoT Applications
The advent of Large Language Models (LLMs) has profoundly transformed our
lives, revolutionizing interactions with AI and lowering the barrier to AI
usage. While LLMs are primarily designed for natural language interaction, the
extensive embedded knowledge empowers them to comprehend digital sensor data.
This capability enables LLMs to engage with the physical world through IoT
sensors and actuators, performing a myriad of AIoT tasks. Consequently, this
evolution triggers a paradigm shift in conventional AIoT application
development, democratizing its accessibility to all by facilitating the design
and development of AIoT applications via natural language. However, some
limitations need to be addressed to unlock the full potential of LLMs in AIoT
application development. First, existing solutions often require transferring
raw sensor data to LLM servers, which raises privacy concerns, incurs high
query fees, and is limited by token size. Moreover, the reasoning processes of
LLMs are opaque to users, making it difficult to verify the robustness and
correctness of inference results. This paper introduces AutoIOT, an LLM-based
automated program generator for AIoT applications. AutoIOT enables users to
specify their requirements using natural language (input) and automatically
synthesizes interpretable programs with documentation (output). AutoIOT
automates the iterative optimization to enhance the quality of generated code
with minimum user involvement. AutoIOT not only makes the execution of AIoT
tasks more explainable but also mitigates privacy concerns and reduces token
costs with local execution of synthesized programs. Extensive experiments and
user studies demonstrate AutoIOT's remarkable capability in program synthesis
for various AIoT tasks. The synthesized programs can match and even outperform
some representative baselines.
☆ Speculative Decoding for Multi-Sample Inference
Yiwei Li, Jiayi Shi, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Yueqi Zhang, Ji Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li
We propose a novel speculative decoding method tailored for multi-sample
reasoning scenarios, such as self-consistency and Best-of-N sampling. Our
method exploits the intrinsic consensus of parallel generation paths to
synthesize high-quality draft tokens without requiring auxiliary models or
external databases. By dynamically analyzing structural patterns across
parallel reasoning paths through a probabilistic aggregation mechanism, it
identifies consensus token sequences that align with the decoding distribution.
Evaluations on mathematical reasoning benchmarks demonstrate a substantial
improvement in draft acceptance rates over baselines, while reducing the
latency in draft token construction. This work establishes a paradigm shift for
efficient multi-sample inference, enabling seamless integration of speculative
decoding with sampling-based reasoning techniques.
☆ Dynamic Knowledge Integration for Evidence-Driven Counter-Argument Generation with Large Language Models
This paper investigates the role of dynamic external knowledge integration in
improving counter-argument generation using Large Language Models (LLMs). While
LLMs have shown promise in argumentative tasks, their tendency to generate
lengthy, potentially unfactual responses highlights the need for more
controlled and evidence-based approaches. We introduce a new manually curated
dataset of argument and counter-argument pairs specifically designed to balance
argumentative complexity with evaluative feasibility. We also propose a new
LLM-as-a-Judge evaluation methodology that shows a stronger correlation with
human judgments compared to traditional reference-based metrics. Our
experimental results demonstrate that integrating dynamic external knowledge
from the web significantly improves the quality of generated counter-arguments,
particularly in terms of relatedness, persuasiveness, and factuality. The
findings suggest that combining LLMs with real-time external knowledge
retrieval offers a promising direction for developing more effective and
reliable counter-argumentation systems.
☆ Fine-Grained Evaluation for Implicit Discourse Relation Recognition
Implicit discourse relation recognition is a challenging task in discourse
analysis due to the absence of explicit discourse connectives between spans of
text. Recent pre-trained language models have achieved great success on this
task. However, there is no fine-grained analysis of the performance of these
pre-trained language models for this task. Therefore, the difficulty and
possible directions of this task is unclear. In this paper, we deeply analyze
the model prediction, attempting to find out the difficulty for the pre-trained
language models and the possible directions of this task. In addition to having
an in-depth analysis for this task by using pre-trained language models, we
semi-manually annotate data to add relatively high-quality data for the
relations with few annotated examples in PDTB 3.0. The annotated data
significantly help improve implicit discourse relation recognition for level-2
senses.
☆ Uncertainty-Aware Decoding with Minimum Bayes Risk ICLR 2025
Despite their outstanding performance in the majority of scenarios,
contemporary language models still occasionally generate undesirable outputs,
for example, hallucinated text. While such behaviors have previously been
linked to uncertainty, there is a notable lack of methods that actively
consider uncertainty during text generation. In this work, we show how Minimum
Bayes Risk (MBR) decoding, which selects model generations according to an
expected risk, can be generalized into a principled uncertainty-aware decoding
method. In short, we account for model uncertainty during decoding by
incorporating a posterior over model parameters into MBR's computation of
expected risk. We show that this modified expected risk is useful for both
choosing outputs and deciding when to abstain from generation and can provide
improvements without incurring overhead. We benchmark different methods for
learning posteriors and show that performance improves with prediction
diversity. We release our code publicly.
comment: ICLR 2025 (Poster)
☆ Coreference as an indicator of context scope in multimodal narrative
We demonstrate that large multimodal language models differ substantially
from humans in the distribution of coreferential expressions in a visual
storytelling task. We introduce a number of metrics to quantify the
characteristics of coreferential patterns in both human- and machine-written
texts. Humans distribute coreferential expressions in a way that maintains
consistency across texts and images, interleaving references to different
entities in a highly varied way. Machines are less able to track mixed
references, despite achieving perceived improvements in generation quality.
comment: 20 pages, 4 tables
☆ Similarity-Based Domain Adaptation with LLMs
Unsupervised domain adaptation leverages abundant labeled data from various
source domains to generalize onto unlabeled target data. Prior research has
primarily focused on learning domain-invariant features across the source and
target domains. However, these methods often require training a model using
source domain data, which is time-consuming and can limit model usage for
applications with different source data. This paper introduces a simple
framework that utilizes the impressive generalization capabilities of Large
Language Models (LLMs) for target data annotation without the need of source
model training, followed by a novel similarity-based knowledge distillation
loss. Our extensive experiments on cross-domain text classification reveal that
our framework achieves impressive performance, specifically, 2.44\% accuracy
improvement when compared to the SOTA method.
☆ Revealing Hidden Mechanisms of Cross-Country Content Moderation with Natural Language Processing
The ability of Natural Language Processing (NLP) methods to categorize text
into multiple classes has motivated their use in online content moderation
tasks, such as hate speech and fake news detection. However, there is limited
understanding of how or why these methods make such decisions, or why certain
content is moderated in the first place. To investigate the hidden mechanisms
behind content moderation, we explore multiple directions: 1) training
classifiers to reverse-engineer content moderation decisions across countries;
2) explaining content moderation decisions by analyzing Shapley values and
LLM-guided explanations. Our primary focus is on content moderation decisions
made across countries, using pre-existing corpora sampled from the Twitter
Stream Grab. Our experiments reveal interesting patterns in censored posts,
both across countries and over time. Through human evaluations of LLM-generated
explanations across three LLMs, we assess the effectiveness of using LLMs in
content moderation. Finally, we discuss potential future directions, as well as
the limitations and ethical considerations of this work. Our code and data are
available at https://github.com/causalNLP/censorship
☆ ZOGRASCOPE: A New Benchmark for Property Graphs
Natural language interfaces to knowledge graphs have become increasingly
important in recent years, enabling easy and efficient access to structured
data. In particular property graphs have seen growing adoption. However, these
kind of graphs remain relatively underrepresented in research, which has
focused in large part on RDF-style graphs. As a matter of fact there is a lack
of resources for evaluating systems on property graphs, with many existing
datasets featuring relatively simple queries. To address this gap, we introduce
ZOGRASCOPE, a benchmark designed specifically for the cypher query language.
The benchmark includes a diverse set of manually annotated queries of varying
complexity. We complement this paper with a set of experiments that test the
performance of out-of-the-box LLMs of different sizes. Our experiments show
that semantic parsing over graphs is still a challenging open problem that can
not be solved by prompting LLMs alone.
☆ PhiloBERTA: A Transformer-Based Cross-Lingual Analysis of Greek and Latin Lexicons
We present PhiloBERTA, a cross-lingual transformer model that measures
semantic relationships between ancient Greek and Latin lexicons. Through
analysis of selected term pairs from classical texts, we use contextual
embeddings and angular similarity metrics to identify precise semantic
alignments. Our results show that etymologically related pairs demonstrate
significantly higher similarity scores, particularly for abstract philosophical
concepts such as epist\=em\=e (scientia) and dikaiosyn\=e (iustitia).
Statistical analysis reveals consistent patterns in these relationships (p =
0.012), with etymologically related pairs showing remarkably stable semantic
preservation compared to control pairs. These findings establish a quantitative
framework for examining how philosophical concepts moved between Greek and
Latin traditions, offering new methods for classical philological research.
☆ WritingBench: A Comprehensive Benchmark for Generative Writing
Yuning Wu, Jiahao Mei, Ming Yan, Chenliang Li, SHaopeng Lai, Yuran Ren, Zijia Wang, Ji Zhang, Mengyue Wu, Qin Jin, Fei Huang
Recent advancements in large language models (LLMs) have significantly
enhanced text generation capabilities, yet evaluating their performance in
generative writing remains a challenge. Existing benchmarks primarily focus on
generic text generation or limited in writing tasks, failing to capture the
diverse requirements of high-quality written contents across various domains.
To bridge this gap, we present WritingBench, a comprehensive benchmark designed
to evaluate LLMs across 6 core writing domains and 100 subdomains, encompassing
creative, persuasive, informative, and technical writing. We further propose a
query-dependent evaluation framework that empowers LLMs to dynamically generate
instance-specific assessment criteria. This framework is complemented by a
fine-tuned critic model for criteria-aware scoring, enabling evaluations in
style, format and length. The framework's validity is further demonstrated by
its data curation capability, which enables 7B-parameter models to approach
state-of-the-art (SOTA) performance. We open-source the benchmark, along with
evaluation tools and modular framework components, to advance the development
of LLMs in writing.
☆ MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio
The rapid advancement of large language models (LLMs) and artificial
intelligence-generated content (AIGC) has accelerated AI-native applications,
such as AI-based storybooks that automate engaging story production for
children. However, challenges remain in improving story attractiveness,
enriching storytelling expressiveness, and developing open-source evaluation
benchmarks and frameworks. Therefore, we propose and opensource MM-StoryAgent,
which creates immersive narrated video storybooks with refined plots,
role-consistent images, and multi-channel audio. MM-StoryAgent designs a
multi-agent framework that employs LLMs and diverse expert tools (generative
models and APIs) across several modalities to produce expressive storytelling
videos. The framework enhances story attractiveness through a multi-stage
writing pipeline. In addition, it improves the immersive storytelling
experience by integrating sound effects with visual, music and narrative
assets. MM-StoryAgent offers a flexible, open-source platform for further
development, where generative modules can be substituted. Both objective and
subjective evaluation regarding textual story quality and alignment between
modalities validate the effectiveness of our proposed MM-StoryAgent system. The
demo and source code are available.
☆ Personalized Text Generation with Contrastive Activation Steering
Personalized text generation aims to infer users' writing style preferences
from their historical texts and generate outputs that faithfully reflect these
stylistic characteristics. Existing solutions primarily adopt two paradigms:
retrieval-augmented generation (RAG) and parameter-efficient fine-tuning
(PEFT). While these approaches have advanced the field, they suffer from two
critical limitations: (1) the entanglement of content semantics and stylistic
patterns in historical texts impedes accurate modeling of user-specific writing
preferences; and (2) scalability challenges arising from both RAG's inference
latency by retrieval operations and PEFT's parameter storage requirements for
per user model. To overcome these limitations, we propose StyleVector, a
training-free framework that disentangles and represents personalized writing
style as a vector in LLM's activation space, enabling style-steered generation
during inference without requiring costly retrieval or parameter storage.
Comprehensive experiments demonstrate that our framework achieves a significant
8% relative improvement in personalized generation while reducing storage
requirements by 1700 times over PEFT method.
☆ Knowledge Updating? No More Model Editing! Just Selective Contextual Reasoning
As real-world knowledge evolves, the information embedded within large
language models (LLMs) can become outdated, inadequate, or erroneous. Model
editing has emerged as a prominent approach for updating LLMs' knowledge with
minimal computational costs and parameter changes. This approach typically
identifies and adjusts specific model parameters associated with newly acquired
knowledge. However, existing methods often underestimate the adverse effects
that parameter modifications can have on broadly distributed knowledge. More
critically, post-edit LLMs frequently struggle with multi-hop reasoning and
continuous knowledge updates. Although various studies have discussed these
shortcomings, there is a lack of comprehensive evaluation. In this paper, we
provide an evaluation of ten model editing methods along four dimensions:
reliability, generalization, locality, and portability. Results confirm that
all ten popular model editing methods show significant shortcomings across
multiple dimensions, suggesting model editing is less promising. We then
propose a straightforward method called Selective Contextual Reasoning (SCR),
for knowledge updating. SCR does not modify model parameters but harnesses
LLM's inherent contextual reasoning capabilities utilizing the updated
knowledge pieces. Under SCR, an LLM first assesses whether an incoming query
falls within the scope of an external knowledge base. If it does, the relevant
external knowledge texts are contextualized to enhance reasoning; otherwise,
the query is answered directly. We evaluate SCR against the ten model editing
methods on two counterfactual datasets with three backbone LLMs. Empirical
results confirm the effectiveness and efficiency of contextual reasoning for
knowledge updating.
☆ Path Pooling: Train-Free Structure Enhancement for Efficient Knowledge Graph Retrieval-Augmented Generation
Although Large Language Models achieve strong success in many tasks, they
still suffer from hallucinations and knowledge deficiencies in real-world
applications. Many knowledge graph-based retrieval-augmented generation
(KG-RAG) methods enhance the quality and credibility of LLMs by leveraging
structure and semantic information in KGs as external knowledge bases. However,
these methods struggle to effectively incorporate structure information, either
incurring high computational costs or underutilizing available knowledge.
Inspired by smoothing operations in graph representation learning, we propose
path pooling, a simple, train-free strategy that introduces structure
information through a novel path-centric pooling operation. It seamlessly
integrates into existing KG-RAG methods in a plug-and-play manner, enabling
richer structure information utilization. Extensive experiments demonstrate
that incorporating the path pooling into the state-of-the-art KG-RAG method
consistently improves performance across various settings while introducing
negligible additional cost. Code is coming soon at
https://github.com/hrwang00/path-pooling.
☆ ORANSight-2.0: Foundational LLMs for O-RAN
Despite the transformative impact of Large Language Models (LLMs) across
critical domains such as healthcare, customer service, and business marketing,
their integration into Open Radio Access Networks (O-RAN) remains limited. This
gap is primarily due to the absence of domain-specific foundational models,
with existing solutions often relying on general-purpose LLMs that fail to
address the unique challenges and technical intricacies of O-RAN. To bridge
this gap, we introduce ORANSight-2.0 (O-RAN Insights), a pioneering initiative
aimed at developing specialized foundational LLMs tailored for O-RAN. Built on
18 LLMs spanning five open-source LLM frameworks, ORANSight-2.0 fine-tunes
models ranging from 1 to 70B parameters, significantly reducing reliance on
proprietary, closed-source models while enhancing performance for O-RAN. At the
core of ORANSight-2.0 is RANSTRUCT, a novel Retrieval-Augmented Generation
(RAG) based instruction-tuning framework that employs two LLM agents to create
high-quality instruction-tuning datasets. The generated dataset is then used to
fine-tune the 18 pre-trained open-source LLMs via QLoRA. To evaluate
ORANSight-2.0, we introduce srsRANBench, a novel benchmark designed for code
generation and codebase understanding in the context of srsRAN, a widely used
5G O-RAN stack. We also leverage ORANBench13K, an existing benchmark for
assessing O-RAN-specific knowledge. Our comprehensive evaluations demonstrate
that ORANSight-2.0 models outperform general-purpose and closed-source models,
such as ChatGPT-4o and Gemini, by 5.421% on ORANBench and 18.465% on
srsRANBench, achieving superior performance while maintaining lower
computational and energy costs. We also experiment with RAG-augmented variants
of ORANSight-2.0 LLMs and thoroughly evaluate their energy characteristics,
demonstrating costs for training, standard inference, and RAG-augmented
inference.
☆ Memory-augmented Query Reconstruction for LLM-based Knowledge Graph Reasoning
Large language models (LLMs) have achieved remarkable performance on
knowledge graph question answering (KGQA) tasks by planning and interacting
with knowledge graphs. However, existing methods often confuse tool utilization
with knowledge reasoning, harming readability of model outputs and giving rise
to hallucinatory tool invocations, which hinder the advancement of KGQA. To
address this issue, we propose Memory-augmented Query Reconstruction for
LLM-based Knowledge Graph Reasoning (MemQ) to decouple LLM from tool invocation
tasks using LLM-built query memory. By establishing a memory module with
explicit descriptions of query statements, the proposed MemQ facilitates the
KGQA process with natural language reasoning and memory-augmented query
reconstruction. Meanwhile, we design an effective and readable reasoning to
enhance the LLM's reasoning capability in KGQA. Experimental results that MemQ
achieves state-of-the-art performance on widely used benchmarks WebQSP and CWQ.
☆ Rewarding Curse: Analyze and Mitigate Reward Modeling Issues for LLM Reasoning
Chain-of-thought (CoT) prompting demonstrates varying performance under
different reasoning tasks. Previous work attempts to evaluate it but falls
short in providing an in-depth analysis of patterns that influence the CoT. In
this paper, we study the CoT performance from the perspective of effectiveness
and faithfulness. For the former, we identify key factors that influence CoT
effectiveness on performance improvement, including problem difficulty,
information gain, and information flow. For the latter, we interpret the
unfaithful CoT issue by conducting a joint analysis of the information
interaction among the question, CoT, and answer. The result demonstrates that,
when the LLM predicts answers, it can recall correct information missing in the
CoT from the question, leading to the problem. Finally, we propose a novel
algorithm to mitigate this issue, in which we recall extra information from the
question to enhance the CoT generation and evaluate CoTs based on their
information gain. Extensive experiments demonstrate that our approach enhances
both the faithfulness and effectiveness of CoT.
comment: 18 pages, 21 figures
☆ Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
Recent advances in large language models have demonstrated remarkable
reasoning capabilities through Chain of Thought (CoT) prompting, but often at
the cost of excessive verbosity in their intermediate outputs, which increases
computational overhead. We introduce Sketch-of-Thought (SoT), a novel prompting
framework that combines cognitive-inspired reasoning paradigms with linguistic
constraints to minimize token usage while preserving reasoning accuracy. SoT is
designed as a flexible framework that can incorporate any custom reasoning
paradigms based on cognitive science, and we instantiate it with three such
paradigms - Conceptual Chaining, Chunked Symbolism, and Expert Lexicons - each
tailored to different reasoning tasks and selected dynamically via a
lightweight routing model. Through comprehensive evaluation across 15 reasoning
datasets with multiple languages and multimodal scenarios, we demonstrate that
SoT achieves token reductions of 76% with negligible accuracy impact. In
certain domains like mathematical and multi-hop reasoning, it even improves
accuracy while using significantly fewer tokens. Our code is publicly
available: https://www.github.com/SimonAytes/SoT.
☆ Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy
Language models are strong few-shot learners and achieve good overall
accuracy in text classification tasks, masking the fact that their results
suffer from great class accuracy imbalance. We believe that the pursuit of
overall accuracy should not come from enriching the strong classes, but from
raising up the weak ones. To address the imbalance, we propose a post-hoc
nonlinear integer programming based debiasing method that ensembles weight
correction and membership correction to enable flexible rectifications of class
probabilities at both class and sample levels, enhancing the performance of
LLMs directly from their outputs. Evaluations with Llama-2-13B on seven text
classification benchmarks show that our approach achieves state-of-the-art
overall accuracy gains with balanced class accuracies. The resulted probability
correction scheme demonstrates that sample-level corrections are necessary to
elevate weak classes. In addition, due to effectively correcting weak classes,
our method also brings significant performance gains to Llama-2-70B, especially
on a biomedical domain task, demonstrating its effectiveness across both small
and large model variants.
☆ Interpersonal Memory Matters: A New Task for Proactive Dialogue Utilizing Conversational History
Proactive dialogue systems aim to empower chatbots with the capability of
leading conversations towards specific targets, thereby enhancing user
engagement and service autonomy. Existing systems typically target pre-defined
keywords or entities, neglecting user attributes and preferences implicit in
dialogue history, hindering the development of long-term user intimacy. To
address these challenges, we take a radical step towards building a more
human-like conversational agent by integrating proactive dialogue systems with
long-term memory into a unified framework. Specifically, we define a novel task
named Memory-aware Proactive Dialogue (MapDia). By decomposing the task, we
then propose an automatic data construction method and create the first Chinese
Memory-aware Proactive Dataset (ChMapData). Furthermore, we introduce a joint
framework based on Retrieval Augmented Generation (RAG), featuring three
modules: Topic Summarization, Topic Retrieval, and Proactive Topic-shifting
Detection and Generation, designed to steer dialogues towards relevant
historical topics at the right time. The effectiveness of our dataset and
models is validated through both automatic and human evaluations. We release
the open-source framework and dataset at
https://github.com/FrontierLabs/MapDia.
☆ RocketEval: Efficient Automated LLM Evaluation via Grading Checklist ICLR 2025
Evaluating large language models (LLMs) in diverse and challenging scenarios
is essential to align them with human preferences. To mitigate the prohibitive
costs associated with human evaluations, utilizing a powerful LLM as a judge
has emerged as a favored approach. Nevertheless, this methodology encounters
several challenges, including substantial expenses, concerns regarding privacy
and security, and reproducibility. In this paper, we propose a straightforward,
replicable, and accurate automated evaluation method by leveraging a
lightweight LLM as the judge, named RocketEval. Initially, we identify that the
performance disparity between lightweight and powerful LLMs in evaluation tasks
primarily stems from their ability to conduct comprehensive analyses, which is
not easily enhanced through techniques such as chain-of-thought reasoning. By
reframing the evaluation task as a multi-faceted Q&A using an instance-specific
checklist, we demonstrate that the limited judgment accuracy of lightweight
LLMs is largely attributes to high uncertainty and positional bias. To address
these challenges, we introduce an automated evaluation process grounded in
checklist grading, which is designed to accommodate a variety of scenarios and
questions. This process encompasses the creation of checklists, the grading of
these checklists by lightweight LLMs, and the reweighting of checklist items to
align with the supervised annotations. Our experiments carried out on the
automated evaluation benchmarks, MT-Bench and WildBench datasets, reveal that
RocketEval, when using Gemma-2-2B as the judge, achieves a high correlation
(0.965) with human preferences, which is comparable to GPT-4o. Moreover,
RocketEval provides a cost reduction exceeding 50-fold for large-scale
evaluation and comparison scenarios. Our code is available at
https://github.com/Joinn99/RocketEval-ICLR .
comment: Accepted by ICLR 2025: https://openreview.net/forum?id=zJjzNj6QUe
☆ Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs
Ling Team, Binwei Zeng, Chao Huang, Chao Zhang, Changxin Tian, Cong Chen, Dingnan Jin, Feng Yu, Feng Zhu, Feng Yuan, Fakang Wang, Gangshan Wang, Guangyao Zhai, Haitao Zhang, Huizhong Li, Jun Zhou, Jia Liu, Junpeng Fang, Junjie Ou, Jun Hu, Ji Luo, Ji Zhang, Jian Liu, Jian Sha, Jianxue Qian, Jiewei Wu, Junping Zhao, Jianguo Li, Jubao Feng, Jingchao Di, Junming Xu, Jinghua Yao, Kuan Xu, Kewei Du, Longfei Li, Lei Liang, Lu Yu, Li Tang, Lin Ju, Peng Xu, Qing Cui, Song Liu, Shicheng Li, Shun Song, Song Yan, Tengwei Cai, Tianyi Chen, Ting Guo, Ting Huang, Tao Feng, Tao Wu, Wei Wu, Xiaolu Zhang, Xueming Yang, Xin Zhao, Xiaobo Hu, Xin Lin, Yao Zhao, Yilong Wang, Yongzhen Guo, Yuanyuan Wang, Yue Yang, Yang Cao, Yuhao Fu, Yi Xiong, Yanzhe Li, Zhe Li, Zhiqiang Zhang, Ziqi Liu, Zhaoxin Huan, Zujie Wen, Zhenhang Sun, Zhuoxuan Du, Zhengyu He
In this technical report, we tackle the challenges of training large-scale
Mixture of Experts (MoE) models, focusing on overcoming cost inefficiency and
resource limitations prevalent in such systems. To address these issues, we
present two differently sized MoE large language models (LLMs), namely
Ling-Lite and Ling-Plus (referred to as "Bailing" in Chinese, spelled
B\v{a}il\'ing in Pinyin). Ling-Lite contains 16.8 billion parameters with 2.75
billion activated parameters, while Ling-Plus boasts 290 billion parameters
with 28.8 billion activated parameters. Both models exhibit comparable
performance to leading industry benchmarks. This report offers actionable
insights to improve the efficiency and accessibility of AI development in
resource-constrained settings, promoting more scalable and sustainable
technologies. Specifically, to reduce training costs for large-scale MoE
models, we propose innovative methods for (1) optimization of model
architecture and training processes, (2) refinement of training anomaly
handling, and (3) enhancement of model evaluation efficiency. Additionally,
leveraging high-quality data generated from knowledge graphs, our models
demonstrate superior capabilities in tool use compared to other models.
Ultimately, our experimental findings demonstrate that a 300B MoE LLM can be
effectively trained on lower-performance devices while achieving comparable
performance to models of a similar scale, including dense and MoE models.
Compared to high-performance devices, utilizing a lower-specification hardware
system during the pre-training phase demonstrates significant cost savings,
reducing computing costs by approximately 20%. The models can be accessed at
https://huggingface.co/inclusionAI.
comment: 34 pages
☆ AutoTestForge: A Multidimensional Automated Testing Framework for Natural Language Processing Models
In recent years, the application of behavioral testing in Natural Language
Processing (NLP) model evaluation has experienced a remarkable and substantial
growth. However, the existing methods continue to be restricted by the
requirements for manual labor and the limited scope of capability assessment.
To address these limitations, we introduce AutoTestForge, an automated and
multidimensional testing framework for NLP models in this paper. Within
AutoTestForge, through the utilization of Large Language Models (LLMs) to
automatically generate test templates and instantiate them, manual involvement
is significantly reduced. Additionally, a mechanism for the validation of test
case labels based on differential testing is implemented which makes use of a
multi-model voting system to guarantee the quality of test cases. The framework
also extends the test suite across three dimensions, taxonomy, fairness, and
robustness, offering a comprehensive evaluation of the capabilities of NLP
models. This expansion enables a more in-depth and thorough assessment of the
models, providing valuable insights into their strengths and weaknesses. A
comprehensive evaluation across sentiment analysis (SA) and semantic textual
similarity (STS) tasks demonstrates that AutoTestForge consistently outperforms
existing datasets and testing tools, achieving higher error detection rates (an
average of $30.89\%$ for SA and $34.58\%$ for STS). Moreover, different
generation strategies exhibit stable effectiveness, with error detection rates
ranging from $29.03\% - 36.82\%$.
comment: 15 pages, 4 figures, Under review
☆ SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding
Large Language Model (LLM) services often face challenges in achieving low
inference latency and meeting Service Level Objectives (SLOs) under dynamic
request patterns. Speculative decoding, which exploits lightweight models for
drafting and LLMs for verification, has emerged as a compelling technique to
accelerate LLM inference. However, existing speculative decoding solutions
often fail to adapt to varying workloads and system environments, resulting in
performance variability and SLO violations. In this paper, we introduce
SpecServe, an efficient LLM inference system that dynamically adjusts
speculative strategies according to real-time request loads and system
configurations. SpecServe proposes a theoretical model to understand and
predict the efficiency of speculative decoding across diverse scenarios.
Additionally, it implements intelligent drafting and verification algorithms to
guarantee optimal performance while achieving high SLO attainment. Experimental
results on real-world LLM traces demonstrate that SpecServe consistently meets
SLOs and achieves substantial performance improvements, yielding
1.14$\times$-14.3$\times$ speedups over state-of-the-art speculative inference
systems.
☆ S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information
The rapid development of large language models (LLMs) has brought significant
attention to speech models, particularly recent progress in speech2speech
protocols supporting speech input and output. However, the existing benchmarks
adopt automatic text-based evaluators for evaluating the instruction following
ability of these models lack consideration for paralinguistic information in
both speech understanding and generation. To address these issues, we introduce
S2S-Arena, a novel arena-style S2S benchmark that evaluates
instruction-following capabilities with paralinguistic information in both
speech-in and speech-out across real-world tasks. We design 154 samples that
fused TTS and live recordings in four domains with 21 tasks and manually
evaluate existing popular speech models in an arena-style manner. The
experimental results show that: (1) in addition to the superior performance of
GPT-4o, the speech model of cascaded ASR, LLM, and TTS outperforms the jointly
trained model after text-speech alignment in speech2speech protocols; (2)
considering paralinguistic information, the knowledgeability of the speech
model mainly depends on the LLM backbone, and the multilingual support of that
is limited by the speech module; (3) excellent speech models can already
understand the paralinguistic information in speech input, but generating
appropriate audio with paralinguistic information is still a challenge.
☆ Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts
The Mixture of Experts (MoE) is an effective architecture for scaling large
language models by leveraging sparse expert activation, optimizing the
trade-off between performance and efficiency. However, under expert
parallelism, MoE suffers from inference inefficiencies due to imbalanced
token-to-expert assignment, where some experts are overloaded while others
remain underutilized. This imbalance leads to poor resource utilization and
increased latency, as the most burdened expert dictates the overall delay, a
phenomenon we define as the \textbf{\textit{Straggler Effect}}. To mitigate
this, we propose Capacity-Aware Inference, including two key techniques: (1)
\textbf{\textit{Capacity-Aware Token Drop}}, which discards overloaded tokens
to regulate the maximum latency of MoE, and (2) \textbf{\textit{Capacity-Aware
Token Reroute}}, which reallocates overflowed tokens to underutilized experts,
balancing the token distribution. These techniques collectively optimize both
high-load and low-load expert utilization, leading to a more efficient MoE
inference pipeline. Extensive experiments demonstrate the effectiveness of our
methods, showing significant improvements in inference efficiency, e.g., 0.2\%
average performance increase and a 1.94$\times$ inference speedup on
Mixtral-8$\times$7B-Instruct.
☆ The study of short texts in digital politics: Document aggregation for topic modeling
Statistical topic modeling is widely used in political science to study text.
Researchers examine documents of varying lengths, from tweets to speeches.
There is ongoing debate on how document length affects the interpretability of
topic models. We investigate the effects of aggregating short documents into
larger ones based on natural units that partition the corpus. In our study, we
analyze one million tweets by U.S. state legislators from April 2016 to
September 2020. We find that for documents aggregated at the account level,
topics are more associated with individual states than when using individual
tweets. This finding is replicated with Wikipedia pages aggregated by birth
cities, showing how document definitions can impact topic modeling results.
☆ No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding
LLM-as-a-Judge is a framework that uses an LLM (large language model) to
evaluate the quality of natural language text - typically text that is also
generated by an LLM. This framework holds great promise due to its relative
low-cost, ease of use, and strong correlations with human stylistic
preferences. However, LLM Judges have been shown to exhibit biases that can
distort their judgments. We evaluate how well LLM Judges can grade whether a
given response to a conversational question is correct, an ability crucial to
soundly estimating the overall response quality. To do so, we create and
publicly release a human-annotated dataset with labels of correctness for 1,200
LLM responses. We source questions from a combination of existing datasets and
a novel, challenging benchmark (BFF-Bench) created for this analysis. We
demonstrate a strong connection between an LLM's ability to correctly answer a
question and grade responses to that question. Although aggregate level
statistics might imply a judge has high agreement with human annotators, it
will struggle on the subset of questions it could not answer. To address this
issue, we recommend a simple solution: provide the judge with a correct,
human-written reference answer. We perform an in-depth analysis on how
reference quality can affect the performance of an LLM Judge. We show that
providing a weaker judge (e.g. Qwen 2.5 7B) with higher quality references
reaches better agreement with human annotators than a stronger judge (e.g.
GPT-4o) with synthetic references.
☆ ModernBERT is More Efficient than Conventional BERT for Chest CT Findings Classification in Japanese Radiology Reports
Objective: This study aims to evaluate and compare the performance of two
Japanese language models-conventional Bidirectional Encoder Representations
from Transformers (BERT) and the newer ModernBERT-in classifying findings from
chest CT reports, with a focus on tokenization efficiency, processing time, and
classification performance. Methods: We conducted a retrospective study using
the CT-RATE-JPN dataset containing 22,778 training reports and 150 test
reports. Both models were fine-tuned for multi-label classification of 18
common chest CT conditions. The training data was split in 18,222:4,556 for
training and validation. Performance was evaluated using F1 scores for each
condition and exact match accuracy across all 18 labels. Results: ModernBERT
demonstrated superior tokenization efficiency, requiring 24.0% fewer tokens per
document (258.1 vs. 339.6) compared to BERT Base. This translated to
significant performance improvements, with ModernBERT completing training in
1877.67 seconds versus BERT's 3090.54 seconds (39% reduction). ModernBERT
processed 38.82 samples per second during training (1.65x faster) and 139.90
samples per second during inference (1.66x faster). Despite these efficiency
gains, classification performance remained comparable, with ModernBERT
achieving superior F1 scores in 8 conditions, while BERT performed better in 4
conditions. Overall exact match accuracy was slightly higher for ModernBERT
(74.67% vs. 72.67%), though this difference was not statistically significant
(p=0.6291). Conclusion: ModernBERT offers substantial improvements in
tokenization efficiency and training speed without sacrificing classification
performance. These results suggest that ModernBERT is a promising candidate for
clinical applications in Japanese radiology reports analysis.
comment: 23 pages, 8 figures
♻ ☆ Shifting Long-Context LLMs Research from Input to Output
Recent advancements in long-context Large Language Models (LLMs) have
primarily concentrated on processing extended input contexts, resulting in
significant strides in long-context comprehension. However, the equally
critical aspect of generating long-form outputs has received comparatively less
attention. This paper advocates for a paradigm shift in NLP research toward
addressing the challenges of long-output generation. Tasks such as novel
writing, long-term planning, and complex reasoning require models to understand
extensive contexts and produce coherent, contextually rich, and logically
consistent extended text. These demands highlight a critical gap in current LLM
capabilities. We underscore the importance of this under-explored domain and
call for focused efforts to develop foundational LLMs tailored for generating
high-quality, long-form outputs, which hold immense potential for real-world
applications.
comment: Preprint
♻ ☆ DIMSUM: Discourse in Mathematical Reasoning as a Supervision Module
We look at reasoning on GSM8k, a dataset of short texts presenting primary
school, math problems. We find, with Mirzadeh et al. (2024), that current LLM
progress on the data set may not be explained by better reasoning but by
exposure to a broader pretraining data distribution. We then introduce a novel
information source for helping models with less data or inferior training
reason better: discourse structure. We show that discourse structure improves
performance for models like Llama2 13b by up to 160%. Even for models that have
most likely memorized the data set, adding discourse structural information to
the model still improves predictions and dramatically improves large model
performance on out of distribution examples.
♻ ☆ START: Self-taught Reasoner with Tools
Chengpeng Li, Mingfeng Xue, Zhenru Zhang, Jiaxi Yang, Beichen Zhang, Xiang Wang, Bowen Yu, Binyuan Hui, Junyang Lin, Dayiheng Liu
Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have
demonstrated remarkable capabilities in complex reasoning tasks through the
utilization of long Chain-of-thought (CoT). However, these models often suffer
from hallucinations and inefficiencies due to their reliance solely on internal
reasoning processes. In this paper, we introduce START (Self-Taught Reasoner
with Tools), a novel tool-integrated long CoT reasoning LLM that significantly
enhances reasoning capabilities by leveraging external tools. Through code
execution, START is capable of performing complex computations, self-checking,
exploring diverse methods, and self-debugging, thereby addressing the
limitations of LRMs. The core innovation of START lies in its self-learning
framework, which comprises two key techniques: 1) Hint-infer: We demonstrate
that inserting artificially designed hints (e.g., ``Wait, maybe using Python
here is a good idea.'') during the inference process of a LRM effectively
stimulates its ability to utilize external tools without the need for any
demonstration data. Hint-infer can also serve as a simple and effective
sequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning
(Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, and
modifying the reasoning trajectories with tool invocation generated by a LRM
via Hint-infer, followed by fine-tuning the LRM. Through this framework, we
have fine-tuned the QwQ-32B model to achieve START. On PhD-level science QA
(GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and the
competition-level code benchmark (LiveCodeBench), START achieves accuracy rates
of 63.6%, 95.0%, 66.7%, 47.1%, and 47.3%, respectively. It significantly
outperforms the base QwQ-32B and achieves performance comparable to the
state-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietary
model o1-Preview.
comment: 38 pages, 5 figures and 6 tables
♻ ☆ LINGOLY-TOO: Disentangling Memorisation from Reasoning with Linguistic Templatisation and Orthographic Obfuscation
Jude Khouja, Karolina Korgul, Simi Hellsten, Lingyi Yang, Vlad Neacsu, Harry Mayne, Ryan Kearns, Andrew Bean, Adam Mahdi
Assessing the reasoning capabilities of large language models (LLMs) is
susceptible to overestimation due to data exposure of evaluation benchmarks. We
introduce a framework for producing linguistic reasoning problems that reduces
the effect of memorisation in model performance estimates and apply this
framework to develop LINGOLY-TOO, a challenging benchmark for linguistic
reasoning. By developing orthographic templates, we dynamically obfuscate the
writing systems of real languages to generate numerousquestion variations.
These variations preserve the reasoning steps required for each solution while
reducing the likelihood of specific problem instances appearing in model
training data. Our experiments demonstrate that frontier models, including
Claud 3.7 Sonnet, o1-preview and DeepSeek R1, struggle with advanced reasoning.
Our analysis also shows that LLMs exhibit noticeable variance in accuracy
across permutations of the same problem, and on average perform better on
questions appearing in their original orthography. Our findings highlight the
opaque nature of response generation in LLMs and provide evidence that prior
data exposure contributes to over estimating the reasoning capabilities of
frontier models.
♻ ☆ Adding Alignment Control to Language Models
Post-training alignment has increasingly become a crucial factor in enhancing
the usability of language models (LMs). However, the strength of alignment
varies depending on individual preferences. This paper proposes a method to
incorporate alignment control into a single model, referred to as CLM. This
approach adds one identity layer preceding the initial layers and performs
preference learning only on this layer to map unaligned input token embeddings
into the aligned space. Experimental results demonstrate that this efficient
fine-tuning method performs comparable to full fine-tuning. During inference,
the input embeddings are processed through the aligned and unaligned layers,
which are then merged through the interpolation coefficient. By controlling
this parameter, the alignment exhibits a clear interpolation and extrapolation
phenomenon.
♻ ☆ Ticktack : Long Span Temporal Alignment of Large Language Models Leveraging Sexagenary Cycle Time Expression
Xue Han, Qian Hu, Yitong Wang, Wenchun Gao, Lianlian Zhang, Qing Wang, Lijun Mei, Chao Deng, Junlan Feng
Large language models (LLMs) suffer from temporal misalignment issues
especially across long span of time. The issue arises from knowing that LLMs
are trained on large amounts of data where temporal information is rather
sparse over long times, such as thousands of years, resulting in insufficient
learning or catastrophic forgetting by the LLMs. This paper proposes a
methodology named "Ticktack" for addressing the LLM's long-time span
misalignment in a yearly setting. Specifically, we first propose to utilize the
sexagenary year expression instead of the Gregorian year expression employed by
LLMs, achieving a more uniform distribution in yearly granularity. Then, we
employ polar coordinates to model the sexagenary cycle of 60 terms and the year
order within each term, with additional temporal encoding to ensure LLMs
understand them. Finally, we present a temporal representational alignment
approach for post-training LLMs that effectively distinguishes time points with
relevant knowledge, hence improving performance on time-related tasks,
particularly over a long period. We also create a long time span benchmark for
evaluation. Experimental results prove the effectiveness of our proposal.
♻ ☆ Chart-HQA: A Benchmark for Hypothetical Question Answering in Charts
Xiangnan Chen, Yuancheng Fang, Qian Xiao, Juncheng Li, Jun Lin, Siliang Tang, Yi Yang, Yueting Zhuang
Multimodal Large Language Models (MLLMs) have garnered significant attention
for their strong visual-semantic understanding. Most existing chart benchmarks
evaluate MLLMs' ability to parse information from charts to answer questions.
However, they overlook the inherent output biases of MLLMs, where models rely
on their parametric memory to answer questions rather than genuinely
understanding the chart content. To address this limitation, we introduce a
novel Chart Hypothetical Question Answering (HQA) task, which imposes
assumptions on the same question to compel models to engage in counterfactual
reasoning based on the chart content. Furthermore, we introduce HAI, a human-AI
interactive data synthesis approach that leverages the efficient text-editing
capabilities of LLMs alongside human expert knowledge to generate diverse and
high-quality HQA data at a low cost. Using HAI, we construct Chart-HQA, a
challenging benchmark synthesized from publicly available data sources.
Evaluation results on 18 MLLMs of varying model sizes reveal that current
models face significant generalization challenges and exhibit imbalanced
reasoning performance on the HQA task.
comment: Under review
♻ ☆ Simple linear attention language models balance the recall-throughput tradeoff
Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, Christopher Ré
Recent work has shown that attention-based language models excel at recall,
the ability to ground generations in tokens previously seen in context.
However, the efficiency of attention-based models is bottle-necked during
inference by the KV-cache's aggressive memory consumption. In this work, we
explore whether we can improve language model efficiency (e.g. by reducing
memory consumption) without compromising on recall. By applying experiments and
theory to a broad set of architectures, we identify a key tradeoff between a
model's state size and recall ability. We show that efficient alternatives to
attention (e.g. H3, Mamba, RWKV) maintain a fixed-size recurrent state, but
struggle at recall. We propose BASED a simple architecture combining linear and
sliding window attention. By varying BASED window size and linear attention
feature dimension, we can dial the state size and traverse the pareto frontier
of the recall-memory tradeoff curve, recovering the full quality of attention
on one end and the small state size of attention-alternatives on the other. We
train language models up to 1.3b parameters and show that BASED matches the
strongest sub-quadratic models (e.g. Mamba) in perplexity and outperforms them
on real-world recall-intensive tasks by 6.22 accuracy points. Implementations
of linear attention are often less efficient than optimized standard attention
implementations. To make BASED competitive, we develop IO-aware algorithms that
enable 24x higher throughput on language generation than FlashAttention-2, when
generating 1024 tokens using 1.3b parameter models. Code for this work is
provided at: https://github.com/HazyResearch/based.
♻ ☆ Entangled Relations: Leveraging NLI and Meta-analysis to Enhance Biomedical Relation Extraction
Recent research efforts have explored the potential of leveraging natural
language inference (NLI) techniques to enhance relation extraction (RE). In
this vein, we introduce MetaEntailRE, a novel adaptation method that harnesses
NLI principles to enhance RE performance. Our approach follows past works by
verbalizing relation classes into class-indicative hypotheses, aligning a
traditionally multi-class classification task to one of textual entailment. We
introduce three key enhancements: (1) Meta-class analysis which, instead of
labeling non-entailed premise-hypothesis pairs with the less informative
"neutral" entailment label, provides additional context by analyzing
overarching meta-relationships between classes; (2) Feasible hypothesis
filtering, which removes unlikely hypotheses from consideration based on domain
knowledge derived from data; and (3) Group-based prediction selection, which
further improves performance by selecting highly confident predictions.
MetaEntailRE is conceptually simple and empirically powerful, yielding
significant improvements over conventional relation extraction techniques and
other NLI formulations. We observe surprisingly large F1 gains of 17.6 points
on BioRED and 13.4 points on ReTACRED compared to conventional methods,
underscoring the versatility of MetaEntailRE across both biomedical and general
domains.
comment: 17 pages, 1 figure
♻ ☆ DeltaProduct: Increasing the Expressivity of DeltaNet Through Products of Householders ICLR 2025
Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive
alternatives to Transformers for sequence modeling, offering efficient training
and linear-time inference. However, existing architectures face a fundamental
trade-off between expressivity and efficiency, dictated by the structure of
their state-transition matrices. While diagonal matrices used in architectures
like Mamba, GLA, or mLSTM yield fast runtime, they suffer from severely limited
expressivity. To address this, recent architectures such as (Gated) DeltaNet
and RWKVv7 adopted a diagonal plus rank-1 structure, allowing simultaneous
token-channel mixing, which overcomes some expressivity limitations with only a
slight decrease in training efficiency. Building on the interpretation of
DeltaNet's recurrence as performing one step of online gradient descent per
token on an associative recall loss, we introduce DeltaProduct, which instead
takes multiple ($n_h$) steps per token. This naturally leads to diagonal plus
rank-$n_h$ state-transition matrices, formed as products of $n_h$ generalized
Householder transformations, providing a tunable mechanism to balance
expressivity and efficiency and a stable recurrence. Through extensive
experiments, we demonstrate that DeltaProduct achieves superior state-tracking
and language modeling capabilities while exhibiting significantly improved
length extrapolation compared to DeltaNet. Additionally, we also strengthen the
theoretical foundation of DeltaNet's expressivity by proving that it can solve
dihedral group word problems in just two layers.
comment: Accepted at ICLR 2025 Workshop on Foundation Models in the Wild
♻ ☆ DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference ICLR'25
Large language models (LLMs) are increasingly employed for complex tasks that
process multiple generation calls in a tree structure with shared prefixes of
tokens, including few-shot prompting, multi-step reasoning, speculative
decoding, etc. However, existing inference systems for tree-based applications
are inefficient due to improper partitioning of queries and KV cache during
attention calculation. This leads to two main issues: (1) a lack of memory
access (IO) reuse for KV cache of shared prefixes, and (2) poor load
balancing.As a result, there is redundant KV cache IO between GPU global memory
and shared memory, along with low GPU utilization. To address these challenges,
we propose DeFT(Decoding with Flash Tree-Attention), a hardware-efficient
attention algorithm with prefix-aware and load-balanced KV cache partitions.
DeFT reduces the number of read/write operations of KV cache during attention
calculation through KV-Guided Grouping, a method that avoids repeatedly loading
KV cache of shared prefixes in attention computation. Additionally, we propose
Flattened Tree KV Splitting, a mechanism that ensures even distribution of the
KV cache across partitions with little computation redundancy, enhancing GPU
utilization during attention computations. By reducing 73-99% KV cache IO and
nearly 100% IO for partial results during attention calculation, DeFT achieves
up to 2.23/3.59x speedup in the end-to-end/attention latency across three
practical tree-based workloads compared to state-of-the-art attention
algorithms. Our code is available at https://github.com/LINs-lab/DeFT.
comment: Update DeFT-v4, accepted by ICLR'25
(https://openreview.net/forum?id=2c7pfOqu9k). Our code is available at
https://github.com/LINs-lab/DeFT
♻ ☆ Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes
Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, Christopher Ré
A long standing goal of the data management community is to develop general,
automated systems that ingest semi-structured documents and output queryable
tables without human effort or domain specific customization. Given the sheer
variety of potential documents, state-of-the art systems make simplifying
assumptions and use domain specific training. In this work, we ask whether we
can maintain generality by using large language models (LLMs). LLMs, which are
pretrained on broad data, can perform diverse downstream tasks simply
conditioned on natural language task descriptions.
We propose and evaluate EVAPORATE, a simple, prototype system powered by
LLMs. We identify two fundamentally different strategies for implementing this
system: prompt the LLM to directly extract values from documents or prompt the
LLM to synthesize code that performs the extraction. Our evaluations show a
cost-quality tradeoff between these two approaches. Code synthesis is cheap,
but far less accurate than directly processing each document with the LLM. To
improve quality while maintaining low cost, we propose an extended code
synthesis implementation, EVAPORATE-CODE+, which achieves better quality than
direct extraction. Our key insight is to generate many candidate functions and
ensemble their extractions using weak supervision. EVAPORATE-CODE+ not only
outperforms the state-of-the art systems, but does so using a sublinear pass
over the documents with the LLM. This equates to a 110x reduction in the number
of tokens the LLM needs to process, averaged across 16 real-world evaluation
settings of 10k documents each.
♻ ☆ Correcting Annotator Bias in Training Data: Population-Aligned Instance Replication (PAIR)
Models trained on crowdsourced labels may not reflect broader population
views, because those who work as annotators do not represent the population. We
propose Population-Aligned Instance Replication (PAIR), a method to address
bias caused by non-representative annotator pools. Using a simulation study of
offensive language and hate speech, we create two types of annotators with
different labeling tendencies and generate datasets with varying proportions of
the types. We observe that models trained on unbalanced annotator pools show
poor calibration compared to those trained on representative data. By
duplicating labels from underrepresented annotator groups to match population
proportions, PAIR reduces bias without collecting additional annotations. These
results suggest that statistical techniques from survey research can improve
model performance. We conclude with practical recommendations for improving the
representativity of training data and model performance.
♻ ☆ SynSUM -- Synthetic Benchmark with Structured and Unstructured Medical Records AAAI 2025
We present the SynSUM benchmark, a synthetic dataset linking unstructured
clinical notes to structured background variables. The dataset consists of
10,000 artificial patient records containing tabular variables (like symptoms,
diagnoses and underlying conditions) and related notes describing the fictional
patient encounter in the domain of respiratory diseases. The tabular portion of
the data is generated through a Bayesian network, where both the causal
structure between the variables and the conditional probabilities are proposed
by an expert based on domain knowledge. We then prompt a large language model
(GPT-4o) to generate a clinical note related to this patient encounter,
describing the patient symptoms and additional context. We conduct both an
expert evaluation study to assess the quality of the generated notes, as well
as running some simple predictor models on both the tabular and text portions
of the dataset, forming a baseline for further research. The SynSUM dataset is
primarily designed to facilitate research on clinical information extraction in
the presence of tabular background variables, which can be linked through
domain knowledge to concepts of interest to be extracted from the text - the
symptoms, in the case of SynSUM. Secondary uses include research on the
automation of clinical reasoning over both tabular data and text, causal effect
estimation in the presence of tabular and/or textual confounders, and
multi-modal synthetic data generation.
comment: The dataset can be downloaded from https://github.com/prabaey/synsum.
Presented at the GenAI4Health workshop at AAAI 2025
♻ ☆ AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models
Large language models (LLMs) often exhibit hallucinations due to incorrect or
outdated knowledge. Hence, model editing methods have emerged to enable
targeted knowledge updates. To achieve this, a prevailing paradigm is the
locating-then-editing approach, which first locates influential parameters and
then edits them by introducing a perturbation. While effective, current studies
have demonstrated that this perturbation inevitably disrupt the originally
preserved knowledge within LLMs, especially in sequential editing scenarios. To
address this, we introduce AlphaEdit, a novel solution that projects
perturbation onto the null space of the preserved knowledge before applying it
to the parameters. We theoretically prove that this projection ensures the
output of post-edited LLMs remains unchanged when queried about the preserved
knowledge, thereby mitigating the issue of disruption. Extensive experiments on
various LLMs, including LLaMA3, GPT2-XL, and GPT-J, show that AlphaEdit boosts
the performance of most locating-then-editing methods by an average of 36.4%
with a single line of additional code for projection solely. Our code is
available at: https://github.com/jianghoucheng/AlphaEdit.
♻ ☆ The interplay between domain specialization and model size
Roseval Malaquias Junior, Ramon Pires, Thales Sales Almeida, Kenzo Sakiyama, Roseli A. F. Romero, Rodrigo Nogueira
Scaling laws for language models have often focused on finding the optimal
model size and token count for training from scratch. However, achieving this
optimal balance requires significant compute resources due to the extensive
data demands when training models from randomly-initialized weights. Continued
pretraining offers a cost-effective alternative, leveraging the compute
investment from pretrained models to incorporate new knowledge without
requiring extensive new data. Recent findings suggest that data quality
influences constants in scaling laws, thereby altering the optimal
parameter-token allocation ratio. Building on this insight, we investigate the
interplay between domain specialization and model size during continued
pretraining under compute-constrained scenarios. Our goal is to identify an
optimal training regime for this scenario and detect patterns in this interplay
that can be generalized across different model sizes and domains. To compare
general and specialized training, we filtered a web-based dataset to extract
data from three domains: legal, medical, and accounting. We pretrained models
with 1.5B, 3B, 7B, and 14B parameters on both the unfiltered and filtered
datasets, then evaluated their performance on domain-specific exams. Results
show that as model size increases, specialized models outperform general models
while requiring less training compute. Additionally, their growing compute
efficiency leads to reduced forgetting of previously learned knowledge.
♻ ☆ SoK: Membership Inference Attacks on LLMs are Rushing Nowhere (and How to Fix It)
Whether LLMs memorize their training data and what this means, from measuring
privacy leakage to detecting copyright violations, has become a rapidly growing
area of research. In the last few months, more than 10 new methods have been
proposed to perform Membership Inference Attacks (MIAs) against LLMs. Contrary
to traditional MIAs which rely on fixed-but randomized-records or models, these
methods are mostly trained and tested on datasets collected post-hoc. Sets of
members and non-members, used to evaluate the MIA, are constructed using
informed guesses after the release of a model. This lack of randomization
raises concerns of a distribution shift between members and non-members. In
this work, we first extensively review the literature on MIAs against LLMs and
show that, while most work focuses on sequence-level MIAs evaluated in post-hoc
setups, a range of target models, motivations and units of interest are
considered. We then quantify distribution shifts present in 6 datasets used in
the literature using a model-less bag of word classifier and show that all
datasets constructed post-hoc suffer from strong distribution shifts. These
shifts invalidate the claims of LLMs memorizing strongly in real-world
scenarios and, potentially, also the methodological contributions of the recent
papers based on these datasets. Yet, all hope might not be lost. We introduce
important considerations to properly evaluate MIAs against LLMs and discuss, in
turn, potential ways forwards: randomized test splits, injections of randomized
(unique) sequences, randomized fine-tuning, and several post-hoc control
methods. While each option comes with its advantages and limitations, we
believe they collectively provide solid grounds to guide MIA development and
study LLM memorization. We conclude with an overview of recommended approaches
to benchmark sequence-level and document-level MIAs against LLMs.
comment: IEEE Conference on Secure and Trustworthy Machine Learning (SaTML
2025)
♻ ☆ LiGT: Layout-infused Generative Transformer for Visual Question Answering on Vietnamese Receipts
Document Visual Question Answering (Document VQA) challenges multimodal
systems to holistically handle textual, layout, and visual modalities to
provide appropriate answers. Document VQA has gained popularity in recent years
due to the increasing amount of documents and the high demand for digitization.
Nonetheless, most of document VQA datasets are developed in high-resource
languages such as English. In this paper, we present ReceiptVQA
(\textbf{Receipt} \textbf{V}isual \textbf{Q}uestion \textbf{A}nswering), the
initial large-scale document VQA dataset in Vietnamese dedicated to receipts, a
document kind with high commercial potentials. The dataset encompasses
\textbf{9,000+} receipt images and \textbf{60,000+} manually annotated
question-answer pairs. In addition to our study, we introduce LiGT
(\textbf{L}ayout-\textbf{i}nfused \textbf{G}enerative \textbf{T}ransformer), a
layout-aware encoder-decoder architecture designed to leverage embedding layers
of language models to operate layout embeddings, minimizing the use of
additional neural modules. Experiments on ReceiptVQA show that our architecture
yielded promising performance, achieving competitive results compared with
outstanding baselines. Furthermore, throughout analyzing experimental results,
we found evident patterns that employing encoder-only model architectures has
considerable disadvantages in comparison to architectures that can generate
answers. We also observed that it is necessary to combine multiple modalities
to tackle our dataset, despite the critical role of semantic understanding from
language models. We hope that our work will encourage and facilitate future
development in Vietnamese document VQA, contributing to a diverse multimodal
research community in the Vietnamese language.
comment: Accepted at IJDAR
♻ ☆ Bootstrapping Language Models with DPO Implicit Rewards ICLR 2025
Changyu Chen, Zichen Liu, Chao Du, Tianyu Pang, Qian Liu, Arunesh Sinha, Pradeep Varakantham, Min Lin
Human alignment in large language models (LLMs) is an active area of
research. A recent groundbreaking work, direct preference optimization (DPO),
has greatly simplified the process from past work in reinforcement learning
from human feedback (RLHF) by bypassing the reward learning stage in RLHF. DPO,
after training, provides an implicit reward model. In this work, we make a
novel observation that this implicit reward model can by itself be used in a
bootstrapping fashion to further align the LLM. Our approach is to use the
rewards from a current LLM to construct a preference dataset, which is then
used in subsequent DPO rounds. We incorporate two refinements to further
improve our approach: 1) length-regularized reward shaping to make the
preference dataset length-unbiased; 2) experience replay to enhance the quality
of the preference dataset. Our approach, named self-alignment with DPO ImpliCit
rEwards (DICE), shows great improvements in alignment. It achieves an increase
of more than 8$\\%$ in lengthcontrolled win rate on AlpacaEval 2 for all the
different base models that we tried, without relying on external feedback. Our
code is available at https://github.com/sail-sg/dice.
comment: Accepted in ICLR 2025
♻ ☆ NLI under the Microscope: What Atomic Hypothesis Decomposition Reveals NAACL 2025
Decomposition of text into atomic propositions is a flexible framework
allowing for the closer inspection of input and output text. We use atomic
decomposition of hypotheses in two natural language reasoning tasks,
traditional NLI and defeasible NLI, to form atomic sub-problems, or granular
inferences that models must weigh when solving the overall problem. These
atomic sub-problems serve as a tool to further understand the structure of both
NLI and defeasible reasoning, probe a model's consistency and understanding of
different inferences, and measure the diversity of examples in benchmark
datasets. Our results indicate that LLMs still struggle with logical
consistency on atomic NLI and defeasible NLI sub-problems. Lastly, we identify
critical atomic sub-problems of defeasible NLI examples, or those that most
contribute to the overall label, and propose a method to measure the
inferential consistency of a model, a metric designed to capture the degree to
which a model makes consistently correct or incorrect predictions about the
same fact under different contexts.
comment: Accepted to NAACL 2025
♻ ☆ CNsum:Automatic Summarization for Chinese News Text
Obtaining valuable information from massive data efficiently has become our
research goal in the era of Big Data. Text summarization technology has been
continuously developed to meet this demand. Recent work has also shown that
transformer-based pre-trained language models have achieved great success on
various tasks in Natural Language Processing (NLP). Aiming at the problem of
Chinese news text summary generation and the application of Transformer
structure on Chinese, this paper proposes a Chinese news text summarization
model (CNsum) based on Transformer structure, and tests it on Chinese datasets
such as THUCNews. The results of the conducted experiments show that CNsum
achieves better ROUGE score than the baseline models, which verifies the
outperformance of the model.
comment: This withdrawal is due to the lack of authorization from all
co-authors for the publication of this version
♻ ☆ MeanCache: User-Centric Semantic Caching for LLM Web Services
Large Language Models (LLMs) like ChatGPT and Llama have revolutionized
natural language processing and search engine dynamics. However, these models
incur exceptionally high computational costs. For instance, GPT-3 consists of
175 billion parameters, where inference demands billions of floating-point
operations. Caching is a natural solution to reduce LLM inference costs on
repeated queries, which constitute about 31% of the total queries. However,
existing caching methods are incapable of finding semantic similarities among
LLM queries nor do they operate on contextual queries, leading to unacceptable
false hit-and-miss rates. This paper introduces MeanCache, a user-centric
semantic cache for LLM-based services that identifies semantically similar
queries to determine cache hit or miss. Using MeanCache, the response to a
user's semantically similar query can be retrieved from a local cache rather
than re-querying the LLM, thus reducing costs, service provider load, and
environmental impact. MeanCache leverages Federated Learning (FL) to
collaboratively train a query similarity model without violating user privacy.
By placing a local cache in each user's device and using FL, MeanCache reduces
the latency and costs and enhances model performance, resulting in lower false
hit rates. MeanCache also encodes context chains for every cached query,
offering a simple yet highly effective mechanism to discern contextual query
responses from standalone. Our experiments benchmarked against the
state-of-the-art caching method, reveal that MeanCache attains an approximately
17% higher F-score and a 20% increase in precision during semantic cache
hit-and-miss decisions while performing even better on contextual queries. It
also reduces the storage requirement by 83% and accelerates semantic cache
hit-and-miss decisions by 11%.
comment: Accepted at 2025 IEEE 39th International Parallel and Distributed
Processing Symposium (IPDPS)
♻ ☆ AILS-NTUA at SemEval-2025 Task 8: Language-to-Code prompting and Error Fixing for Tabular Question Answering
In this paper, we present our submission to SemEval-2025 Task 8: Question
Answering over Tabular Data. This task, evaluated on the DataBench dataset,
assesses Large Language Models' (LLMs) ability to answer natural language
questions over structured data while addressing topic diversity and table size
limitations in previous benchmarks. We propose a system that employs effective
LLM prompting to translate natural language queries into executable code,
enabling accurate responses, error correction, and interpretability. Our
approach ranks first in both subtasks of the competition in the proprietary
model category, significantly outperforming the organizer's baseline.
♻ ☆ LIFT: Improving Long Context Understanding of Large Language Models through Long Input Fine-Tuning
Long context understanding remains challenging for large language models due
to their limited context windows. This paper presents Long Input Fine-Tuning
(LIFT), a novel framework for long-context modeling that can improve the
long-context performance of arbitrary (short-context) LLMs by dynamically
adapting model parameters based on the long input. Importantly, LIFT, rather
than endlessly extending the context window size to accommodate increasingly
longer inputs in context, chooses to store and absorb the long input in
parameter. By fine-tuning the long input into model parameters, LIFT allows
short-context LLMs to answer questions even when the required information is
not provided in the context during inference. Furthermore, to enhance LIFT
performance while maintaining the original in-context learning (ICL)
capabilities, we introduce Gated Memory, a specialized attention adapter that
automatically balances long input memorization and ICL. We provide a
comprehensive analysis of the strengths and limitations of LIFT on long context
understanding, offering valuable directions for future research.
comment: arXiv admin note: text overlap with arXiv:2412.13626
♻ ☆ ECCOS: Efficient Capability and Cost Coordinated Scheduling for Multi-LLM Serving
As large language models (LLMs) are increasingly deployed as service
endpoints in systems, the surge in query volume creates significant scheduling
challenges. Existing scheduling frameworks mainly target at latency
optimization while neglecting the capability of LLMs to serve different level
of queries, which could lead to computational resource waste. This paper
addresses this challenge by proposing a capability-cost coordinated scheduling
framework, ECCOS, for multi-LLM serving, which explicitly constrains response
quality and workload to optimize LLM inference cost. Specifically, it
introduces the two-stage scheduling by designing a multi-objective predictor
and a constrained optimizer. The predictor estimates both model capabilities
and computational costs through training-based and retrieval-based approaches,
while the optimizer determines cost-optimal assignments under quality and
workload constraints. It also introduces QAServe, a dataset collected for
sample-wise response quality and costs by zero-shot prompting different LLMs on
knowledge QA and mathematical reasoning. Extensive experiments demonstrate that
ECCOS improves success rates by 6.30% while reducing costs by 10.15% compared
to existing methods, consuming less than 0.5% of LLM response time. The code is
available at: https://github.com/agiresearch/ECCOS.
♻ ☆ Emergent Language: A Survey and Taxonomy
Jannik Peters, Constantin Waubert de Puiseau, Hasan Tercan, Arya Gopikrishnan, Gustavo Adolpho Lucas De Carvalho, Christian Bitter, Tobias Meisen
The field of emergent language represents a novel area of research within the
domain of artificial intelligence, particularly within the context of
multi-agent reinforcement learning. Although the concept of studying language
emergence is not new, early approaches were primarily concerned with explaining
human language formation, with little consideration given to its potential
utility for artificial agents. In contrast, studies based on reinforcement
learning aim to develop communicative capabilities in agents that are
comparable to or even superior to human language. Thus, they extend beyond the
learned statistical representations that are common in natural language
processing research. This gives rise to a number of fundamental questions, from
the prerequisites for language emergence to the criteria for measuring its
success. This paper addresses these questions by providing a comprehensive
review of 181 scientific publications on emergent language in artificial
intelligence. Its objective is to serve as a reference for researchers
interested in or proficient in the field. Consequently, the main contributions
are the definition and overview of the prevailing terminology, the analysis of
existing evaluation methods and metrics, and the description of the identified
research gaps.
comment: published in Journal of Autonomous Agents and Multi-Agent Systems
♻ ☆ Familiarity: Better Evaluation of Zero-Shot Named Entity Recognition by Quantifying Label Shifts in Synthetic Training Data
Zero-shot named entity recognition (NER) is the task of detecting named
entities of specific types (such as 'Person' or 'Medicine') without any
training examples. Current research increasingly relies on large synthetic
datasets, automatically generated to cover tens of thousands of distinct entity
types, to train zero-shot NER models. However, in this paper, we find that
these synthetic datasets often contain entity types that are semantically
highly similar to (or even the same as) those in standard evaluation
benchmarks. Because of this overlap, we argue that reported F1 scores for
zero-shot NER overestimate the true capabilities of these approaches. Further,
we argue that current evaluation setups provide an incomplete picture of
zero-shot abilities since they do not quantify the label shift (i.e., the
similarity of labels) between training and evaluation datasets. To address
these issues, we propose Familiarity, a novel metric that captures both the
semantic similarity between entity types in training and evaluation, as well as
their frequency in the training data, to provide an estimate of label shift. It
allows researchers to contextualize reported zero-shot NER scores when using
custom synthetic training datasets. Further, it enables researchers to generate
evaluation setups of various transfer difficulties for fine-grained analysis of
zero-shot NER.
comment: 9 pages, 4 figures, 5 tables
♻ ☆ A Confidence-based Acquisition Model for Self-supervised Active Learning and Label Correction
Carel van Niekerk, Christian Geishauser, Michael Heck, Shutong Feng, Hsien-chin Lin, Nurul Lubis, Benjamin Ruppik, Renato Vukovic, Milica Gašić
Supervised neural approaches are hindered by their dependence on large,
meticulously annotated datasets, a requirement that is particularly cumbersome
for sequential tasks. The quality of annotations tends to deteriorate with the
transition from expert-based to crowd-sourced labelling. To address these
challenges, we present CAMEL (Confidence-based Acquisition Model for Efficient
self-supervised active Learning), a pool-based active learning framework
tailored to sequential multi-output problems. CAMEL possesses two core
features: (1) it requires expert annotators to label only a fraction of a
chosen sequence, and (2) it facilitates self-supervision for the remainder of
the sequence. By deploying a label correction mechanism, CAMEL can also be
utilised for data cleaning. We evaluate CAMEL on two sequential tasks, with a
special emphasis on dialogue belief tracking, a task plagued by the constraints
of limited and noisy datasets. Our experiments demonstrate that CAMEL
significantly outperforms the baselines in terms of efficiency. Furthermore,
the data corrections suggested by our method contribute to an overall
improvement in the quality of the resulting datasets.
♻ ☆ EdgeMoE: Empowering Sparse Large Language Models on Mobile Devices
Large language models (LLMs) such as GPTs and Mixtral-8x7B have
revolutionized machine intelligence due to their exceptional abilities in
generic ML tasks. Transiting LLMs from datacenters to edge devices brings
benefits like better privacy and availability, but is challenged by their
massive parameter size and thus unbearable runtime costs. To this end, we
present EdgeMoE, an on-device inference engine for mixture-of-expert (MoE) LLMs
-- a popular form of sparse LLM that scales its parameter size with almost
constant computing complexity. EdgeMoE achieves both memory- and
compute-efficiency by partitioning the model into the storage hierarchy:
non-expert weights are held in device memory; while expert weights are held on
external storage and fetched to memory only when activated. This design is
motivated by a key observation that expert weights are bulky but infrequently
used due to sparse activation. To further reduce the expert I/O swapping
overhead, EdgeMoE incorporates two novel techniques: (1) expert-wise bitwidth
adaptation that reduces the expert sizes with tolerable accuracy loss; (2)
expert preloading that predicts the activated experts ahead of time and
preloads it with the compute-I/O pipeline. On popular MoE LLMs and edge
devices, EdgeMoE showcase significant memory savings and speedup over
competitive baselines. The code is available at
https://github.com/UbiquitousLearning/mllm.
♻ ☆ Dialogue Ontology Relation Extraction via Constrained Chain-of-Thought Decoding SIGDIAL 2024
Renato Vukovic, David Arps, Carel van Niekerk, Benjamin Matthias Ruppik, Hsien-Chin Lin, Michael Heck, Milica Gašić
State-of-the-art task-oriented dialogue systems typically rely on
task-specific ontologies for fulfilling user queries. The majority of
task-oriented dialogue data, such as customer service recordings, comes without
ontology and annotation. Such ontologies are normally built manually, limiting
the application of specialised systems. Dialogue ontology construction is an
approach for automating that process and typically consists of two steps: term
extraction and relation extraction. In this work, we focus on relation
extraction in a transfer learning set-up. To improve the generalisation, we
propose an extension to the decoding mechanism of large language models. We
adapt Chain-of-Thought (CoT) decoding, recently developed for reasoning
problems, to generative relation extraction. Here, we generate multiple
branches in the decoding space and select the relations based on a confidence
threshold. By constraining the decoding to ontology terms and relations, we aim
to decrease the risk of hallucination. We conduct extensive experimentation on
two widely used datasets and find improvements in performance on target
ontology for source fine-tuned and one-shot prompted large language models.
comment: Accepted to appear at SIGDIAL 2024. 9 pages, 4 figures
♻ ☆ LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm
Siwei Wu, Yizhi Li, Xingwei Qu, Rishi Ravikumar, Yucheng Li, Tyler Loakman, Shanghaoran Quan, Xiaoyong Wei, Riza Batista-Navarro, Chenghua Lin
Large Language Models (LLMs) have achieved remarkable success in various
natural language processing tasks, yet their ability to generate long-form
content remains poorly understood and evaluated. Our analysis reveals that
current LLMs struggle with length requirements and information density in
long-text generation, with performance deteriorating as text length increases.
To quantitively locate such a performance degradation and provide further
insights on model development, we present LongEval, a benchmark that evaluates
long-text generation through both direct and plan-based generation paradigms,
inspired by cognitive and linguistic writing models. The comprehensive
experiments in this work reveal interesting findings such as that while model
size correlates with generation ability, the small-scale model (e.g.,
LongWriter), well-trained on long texts, has comparable performance. All code
and datasets are released in https://github.com/Wusiwei0410/LongEval.
comment: Under review
♻ ☆ Are AI Detectors Good Enough? A Survey on Quality of Datasets With Machine-Generated Texts AAAI 2025
The rapid development of autoregressive Large Language Models (LLMs) has
significantly improved the quality of generated texts, necessitating reliable
machine-generated text detectors. A huge number of detectors and collections
with AI fragments have emerged, and several detection methods even showed
recognition quality up to 99.9% according to the target metrics in such
collections. However, the quality of such detectors tends to drop dramatically
in the wild, posing a question: Are detectors actually highly trustworthy or do
their high benchmark scores come from the poor quality of evaluation datasets?
In this paper, we emphasise the need for robust and qualitative methods for
evaluating generated data to be secure against bias and low generalising
ability of future model. We present a systematic review of datasets from
competitions dedicated to AI-generated content detection and propose methods
for evaluating the quality of datasets containing AI-generated fragments. In
addition, we discuss the possibility of using high-quality generated data to
achieve two goals: improving the training of detection models and improving the
training datasets themselves. Our contribution aims to facilitate a better
understanding of the dynamics between human and machine text, which will
ultimately support the integrity of information in an increasingly automated
world. The code is available at
https://github.com/Advacheck-OU/ai-dataset-analysing.
comment: Presented at Preventing and Detecting LLM Misinformation (PDLM) at
AAAI 2025
♻ ☆ RoToR: Towards More Reliable Responses for Order-Invariant Inputs
Mitigating positional bias of language models (LMs) for listwise inputs is a
well-known and important problem (e.g., lost-in-the-middle). While zero-shot
order-invariant LMs have been proposed to solve this issue, their success on
practical listwise problems has been limited. In this work, as a first
contribution, we identify and overcome two limitations to make zero-shot
invariant LMs more practical: (1) training and inference distribution mismatch
arising from modifying positional ID assignments to enforce invariance, and (2)
failure to adapt to a mixture of order-invariant and sensitive inputs in
practical listwise problems. Then, to overcome these issues we propose (1)
RoToR, a zero-shot invariant LM for genuinely order-invariant inputs with
minimal modifications of positional IDs, and (2) Selective Routing, an adaptive
framework that handles both order-invariant and order-sensitive inputs in
listwise tasks. On the Lost in the middle (LitM), Knowledge Graph QA (KGQA),
and MMLU benchmarks, we show that RoToR with Selective Routing can effectively
handle practical listwise input tasks in a zero-shot manner.
♻ ☆ DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios NeurIPS 2024
Detecting text generated by large language models (LLMs) is of great recent
interest. With zero-shot methods like DetectGPT, detection capabilities have
reached impressive levels. However, the reliability of existing detectors in
real-world applications remains underexplored. In this study, we present a new
benchmark, DetectRL, highlighting that even state-of-the-art (SOTA) detection
techniques still underperformed in this task. We collected human-written
datasets from domains where LLMs are particularly prone to misuse. Using
popular LLMs, we generated data that better aligns with real-world
applications. Unlike previous studies, we employed heuristic rules to create
adversarial LLM-generated text, simulating various prompts usages, human
revisions like word substitutions, and writing noises like spelling mistakes.
Our development of DetectRL reveals the strengths and limitations of current
SOTA detectors. More importantly, we analyzed the potential impact of writing
styles, model types, attack methods, the text lengths, and real-world human
writing factors on different types of detectors. We believe DetectRL could
serve as an effective benchmark for assessing detectors in real-world
scenarios, evolving with advanced attack methods, thus providing more stressful
evaluation to drive the development of more efficient detectors. Data and code
are publicly available at: https://github.com/NLP2CT/DetectRL.
comment: Accepted to NeurIPS 2024 Datasets and Benchmarks Track (Camera-Ready)
♻ ☆ Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Microsoft, :, Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxuan Hu, Xin Jin, Mahmoud Khademi, Dongwoo Kim, Young Jin Kim, Gina Lee, Jinyu Li, Yunsheng Li, Chen Liang, Xihui Lin, Zeqi Lin, Mengchen Liu, Yang Liu, Gilsinia Lopez, Chong Luo, Piyush Madan, Vadim Mazalov, Arindam Mitra, Ali Mousavi, Anh Nguyen, Jing Pan, Daniel Perez-Becker, Jacob Platin, Thomas Portet, Kai Qiu, Bo Ren, Liliang Ren, Sambuddha Roy, Ning Shang, Yelong Shen, Saksham Singhal, Subhojit Som, Xia Song, Tetyana Sych, Praneetha Vaddamanu, Shuohang Wang, Yiming Wang, Zhenghao Wang, Haibin Wu, Haoran Xu, Weijian Xu, Yifan Yang, Ziyi Yang, Donghan Yu, Ishmam Zabir, Jianwen Zhang, Li Lyna Zhang, Yunan Zhang, Xiren Zhou
We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable
language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language
model trained on high-quality web and synthetic data, significantly
outperforming recent open-source models of similar size and matching the
performance of models twice its size on math and coding tasks requiring complex
reasoning. This achievement is driven by a carefully curated synthetic data
recipe emphasizing high-quality math and coding datasets. Compared to its
predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of
200K tokens to better support multilingual applications, as well as group query
attention for more efficient long-sequence generation. Phi-4-Multimodal is a
multimodal model that integrates text, vision, and speech/audio input
modalities into a single model. Its novel modality extension approach leverages
LoRA adapters and modality-specific routers to allow multiple inference modes
combining various modalities without interference. For example, it now ranks
first in the OpenASR leaderboard to date, although the LoRA component of the
speech/audio modality has just 460 million parameters. Phi-4-Multimodal
supports scenarios involving (vision + language), (vision + speech), and
(speech/audio) inputs, outperforming larger vision-language and speech-language
models on a wide range of tasks. Additionally, we experiment to further train
Phi-4-Mini to enhance its reasoning capabilities. Despite its compact
3.8-billion-parameter size, this experimental version achieves reasoning
performance on par with or surpassing significantly larger models, including
DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B.
comment: 39 pages
♻ ☆ AdEval: Alignment-based Dynamic Evaluation to Mitigate Data Contamination in Large Language Models
As Large Language Models (LLMs) are pretrained on massive-scale corpora, the
issue of data contamination has become increasingly severe, leading to
potential overestimation of model performance during evaluation. To address
this, we propose AdEval (Alignment-based Dynamic Evaluation), a dynamic data
evaluation method aimed at mitigating the impact of data contamination on
evaluation reliability. Experimental results on multiple datasets demonstrate
that AdEval effectively reduces the impact of data contamination on evaluation
outcomes, enhancing both the fairness and reliability of the evaluation
process.
comment: There are serious academic problems in this paper, such as data
falsification and plagiarism in the method of the paper
♻ ☆ Transformers for molecular property prediction: Domain adaptation efficiently improves performance
Most of the current transformer-based chemical language models are
pre-trained on millions to billions of molecules. However, the improvement from
such scaling in dataset size is not confidently linked to improved molecular
property prediction. The aim of this study is to investigate and overcome some
of the limitations of transformer models in predicting molecular properties.
Specifically, we examine the impact of pre-training dataset size and diversity
on the performance of transformer models and investigate the use of domain
adaptation as a technique for improving model performance. First, our findings
indicate that increasing pretraining dataset size beyond 400K molecules from
the GuacaMol dataset does not result in a significant improvement on four ADME
endpoints, namely, solubility, permeability, microsomal stability, and plasma
protein binding. Second, our results demonstrate that using domain adaptation
by further training the transformer model on a small set of domain-relevant
molecules, i.e., a few hundred to a few thousand, using multi-task regression
of physicochemical properties was sufficient to significantly improve
performance for three out of the four investigated ADME endpoints (P-value <
0.001). Finally, we observe that a model pre-trained on 400K molecules and
domain adopted on a few hundred/thousand molecules performs similarly (P-value
> 0.05) to more complicated transformer models like MolBERT(pre-trained on 1.3M
molecules) and MolFormer (pre-trained on 100M molecules). A comparison to a
random forest model trained on basic physicochemical properties showed similar
performance to the examined transformer models. We believe that current
transformer models can be improved through further systematic analysis of
pre-training and downstream data, pre-training objectives, and scaling laws,
ultimately leading to better and more helpful models.
♻ ☆ ARIES: Stimulating Self-Refinement of Large Language Models by Iterative Preference Optimization
Yongcheng Zeng, Xinyu Cui, Xuanfa Jin, Guoqing Liu, Zexu Sun, Quan He, Dong Li, Ning Yang, Jianye Hao, Haifeng Zhang, Jun Wang
A truly intelligent Large Language Model (LLM) should be capable of
correcting errors in its responses through external interactions. However, even
the most advanced models often face challenges in improving their outputs. In
this paper, we explore how to cultivate LLMs with the self-refinement
capability through iterative preference training, and how this ability can be
leveraged to improve model performance during inference. To this end, we
introduce a novel post-training and inference framework, called ARIES: Adaptive
Refinement and Iterative Enhancement Structure. This method iteratively
performs preference training and self-refinement-based data collection. During
training, ARIES strengthen the model's direct question-answering capability
while simultaneously unlocking its self-refinement potential. During inference,
ARIES harnesses this self-refinement capability to generate a series of
progressively refined responses, which are then filtered using either the
Reward Model Scoring or a simple yet effective Rule-Based Selection mechanism,
specifically tailored to our approach, to construct a dataset for the next
round of preference training. Experimental results demonstrate the remarkable
performance of ARIES. When applied to the Llama-3.1-8B model and under the
self-refinement setting, ARIES surpasses powerful models such as GPT-4o,
achieving 62.3% length-controlled (LC) and a 63.3% raw win rates on AlpacaEval
2, outperforming Iterative DPO by 27.8% and 35.5% respectively, as well as a
50.3% win rate on Arena-Hard, surpassing Iterative DPO by 26.6%. Furthermore,
ARIES consistently enhances performance on mathematical reasoning tasks like
GSM8K and MATH.
♻ ☆ CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections
Mohamed Fazli Imam, Rufael Fedaku Marew, Jameel Hassan, Mustansar Fiaz, Alham Fikri Aji, Hisham Cholakkal
In the era of foundation models, CLIP has emerged as a powerful tool for
aligning text & visual modalities into a common embedding space. However, the
alignment objective used to train CLIP often results in subpar visual features
for fine-grained tasks. In contrast, SSL-pretrained models like DINO excel at
extracting rich visual features due to their specialized training paradigm.
Yet, these SSL models require an additional supervised linear probing step,
which relies on fully labeled data which is often expensive and difficult to
obtain at scale. In this paper, we propose a label-free prompt-tuning method
that leverages the rich visual features of self-supervised learning models
(DINO) and the broad textual knowledge of large language models (LLMs) to
largely enhance CLIP-based image classification performance using unlabeled
images. Our approach unfolds in three key steps: (1) We generate robust textual
feature embeddings that more accurately represent object classes by leveraging
class-specific descriptions from LLMs, enabling more effective zero-shot
classification compared to CLIP's default name-specific prompts. (2) These
textual embeddings are then used to produce pseudo-labels to train an alignment
module that integrates the complementary strengths of LLM description-based
textual embeddings & DINO's visual features. (3) Finally, we prompt-tune CLIP's
vision encoder through DINO-assisted supervision using the trained alignment
module. This three-step process allows us to harness the best of visual &
textual foundation models, resulting in a powerful and efficient approach that
surpasses state-of-the-art label-free classification methods. Notably, our
framework, NoLA (No Labels Attached), achieves an average absolute gain of 3.6%
over the state-of-the-art LaFTer across 11 diverse image classification
datasets. Our code & models can be found at https://github.com/fazliimam/NoLA.
♻ ☆ LLM-based Discriminative Reasoning for Knowledge Graph Question Answering
Large language models (LLMs) based on generative pre-trained Transformer have
achieved remarkable performance on knowledge graph question-answering (KGQA)
tasks. However, LLMs often produce ungrounded subgraph planning or reasoning
results in KGQA due to the hallucinatory behavior brought by the generative
paradigm. To tackle this issue, we propose READS to reformulate the KGQA
process into discriminative subtasks, which simplifies the search space for
each subtasks. Based on the subtasks, we design a new corresponding
discriminative inference strategy to conduct the reasoning for KGQA, thereby
alleviating hallucination and ungrounded reasoning issues in LLMs. Experimental
results show that the proposed approach outperforms multiple strong comparison
methods, along with achieving state-of-the-art performance on widely used
benchmarks WebQSP and CWQ.
♻ ☆ Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection
Large Language Models (LLMs) have been shown to exhibit various biases and
stereotypes in their generated content. While extensive research has
investigated bias in LLMs, prior work has predominantly focused on explicit
bias, leaving the more nuanced implicit biases largely unexplored. This paper
presents a systematic framework grounded in social psychology theories to
investigate and compare explicit and implicit biases in LLMs. We propose a
novel "self-reflection" based evaluation framework that operates in two phases:
first measuring implicit bias through simulated psychological assessment
methods, then evaluating explicit bias by prompting LLMs to analyze their own
generated content. Through extensive experiments on state-of-the-art LLMs
across multiple social dimensions, we demonstrate that LLMs exhibit a
substantial inconsistency between explicit and implicit biases, where explicit
biases manifest as mild stereotypes while implicit biases show strong
stereotypes. Furthermore, we investigate the underlying factors contributing to
this explicit-implicit bias inconsistency. Our experiments examine the effects
of training data scale, model parameters, and alignment techniques. Results
indicate that while explicit bias diminishes with increased training data and
model size, implicit bias exhibits a contrasting upward trend. Notably,
contemporary alignment methods (e.g., RLHF, DPO) effectively suppress explicit
bias but show limited efficacy in mitigating implicit bias. These findings
suggest that while scaling up models and alignment training can address
explicit bias, the challenge of implicit bias requires novel approaches beyond
current methodologies.
♻ ☆ NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM
Vision-and-Language Navigation (VLN) is an essential skill for embodied
agents, allowing them to navigate in 3D environments following natural language
instructions. High-performance navigation models require a large amount of
training data, the high cost of manually annotating data has seriously hindered
this field. Therefore, some previous methods translate trajectory videos into
step-by-step instructions for expanding data, but such instructions do not
match well with users' communication styles that briefly describe destinations
or state specific needs. Moreover, local navigation trajectories overlook
global context and high-level task planning. To address these issues, we
propose NavRAG, a retrieval-augmented generation (RAG) framework that generates
user demand instructions for VLN. NavRAG leverages LLM to build a hierarchical
scene description tree for 3D scene understanding from global layout to local
details, then simulates various user roles with specific demands to retrieve
from the scene tree, generating diverse instructions with LLM. We annotate over
2 million navigation instructions across 861 scenes and evaluate the data
quality and navigation performance of trained models.
♻ ☆ How Diversely Can Language Models Solve Problems? Exploring the Algorithmic Diversity of Model-Generated Code
Language models (LMs) have exhibited impressive abilities in generating code
from natural language requirements. In this work, we highlight the diversity of
code generated by LMs as a critical criterion for evaluating their code
generation capabilities. There is a lack of studies focused on assessing the
diversity of generated code, which overlooks its importance in code LMs.
Therefore, we propose a systematic approach to evaluate code diversity,
introducing various metrics with inter-code similarity. Specifically, we
introduce code clustering methods that leverages LMs' capabilities in code
understanding and reasoning, resulting in a set of metrics that represent the
number of algorithms in model-generated solutions. We extensively investigate
the property of model-generated solutions by contrasting them with
human-written ones and quantifying the impact of various factors on code
diversity: model size, temperature, instruction tuning, and problem complexity.
Our analysis demonstrates that model-generated solutions exhibit low
algorithmic diversity, which was neglected by the research community. Moreover,
we explore methods to increase code diversity by combining solutions from
different models and increasing sampling temperatures. Our findings highlight
that code diversity can be enhanced with the help of heterogeneous models and
setting temperature beyond 1.0 that has not been fully explored due to the
functional correctness degradation. To facilitate our research direction, we
publicly share our code and datasets through open-source repositories.
♻ ☆ When Large Language Models Meet Evolutionary Algorithms: Potential Enhancements and Challenges
Pre-trained large language models (LLMs) exhibit powerful capabilities for
generating natural text. Evolutionary algorithms (EAs) can discover diverse
solutions to complex real-world problems. Motivated by the common collective
and directionality of text generation and evolution, this paper first
illustrates the conceptual parallels between LLMs and EAs at a micro level,
which includes multiple one-to-one key characteristics: token representation
and individual representation, position encoding and fitness shaping, position
embedding and selection, Transformers block and reproduction, and model
training and parameter adaptation. These parallels highlight potential
opportunities for technical advancements in both LLMs and EAs. Subsequently, we
analyze existing interdisciplinary research from a macro perspective to uncover
critical challenges, with a particular focus on evolutionary fine-tuning and
LLM-enhanced EAs. These analyses not only provide insights into the
evolutionary mechanisms behind LLMs but also offer potential directions for
enhancing the capabilities of artificial agents.
comment: The article has been accepted for publication in Research
♻ ☆ Zero-resource Hallucination Detection for Text Generation via Graph-based Contextual Knowledge Triples Modeling AAAI25
Xinyue Fang, Zhen Huang, Zhiliang Tian, Minghui Fang, Ziyi Pan, Quntian Fang, Zhihua Wen, Hengyue Pan, Dongsheng Li
LLMs obtain remarkable performance but suffer from hallucinations. Most
research on detecting hallucination focuses on the questions with short and
concrete correct answers that are easy to check the faithfulness. Hallucination
detections for text generation with open-ended answers are more challenging.
Some researchers use external knowledge to detect hallucinations in generated
texts, but external resources for specific scenarios are hard to access. Recent
studies on detecting hallucinations in long text without external resources
conduct consistency comparison among multiple sampled outputs. To handle long
texts, researchers split long texts into multiple facts and individually
compare the consistency of each pairs of facts. However, these methods (1)
hardly achieve alignment among multiple facts; (2) overlook dependencies
between multiple contextual facts. In this paper, we propose a graph-based
context-aware (GCA) hallucination detection for text generations, which aligns
knowledge facts and considers the dependencies between contextual knowledge
triples in consistency comparison. Particularly, to align multiple facts, we
conduct a triple-oriented response segmentation to extract multiple knowledge
triples. To model dependencies among contextual knowledge triple (facts), we
construct contextual triple into a graph and enhance triples' interactions via
message passing and aggregating via RGCN. To avoid the omission of knowledge
triples in long text, we conduct a LLM-based reverse verification via
reconstructing the knowledge triples. Experiments show that our model enhances
hallucination detection and excels all baselines.
comment: Accepted by AAAI25
♻ ☆ Energy-Based Diffusion Language Models for Text Generation
Minkai Xu, Tomas Geffner, Karsten Kreis, Weili Nie, Yilun Xu, Jure Leskovec, Stefano Ermon, Arash Vahdat
Despite remarkable progress in autoregressive language models, alternative
generative paradigms beyond left-to-right generation are still being actively
explored. Discrete diffusion models, with the capacity for parallel generation,
have recently emerged as a promising alternative. Unfortunately, these models
still underperform the autoregressive counterparts, with the performance gap
increasing when reducing the number of sampling steps. Our analysis reveals
that this degradation is a consequence of an imperfect approximation used by
diffusion models. In this work, we propose Energy-based Diffusion Language
Model (EDLM), an energy-based model operating at the full sequence level for
each diffusion step, introduced to improve the underlying approximation used by
diffusion models. More specifically, we introduce an EBM in a residual form,
and show that its parameters can be obtained by leveraging a pretrained
autoregressive model or by finetuning a bidirectional transformer via noise
contrastive estimation. We also propose an efficient generation algorithm via
parallel important sampling. Comprehensive experiments on language modeling
benchmarks show that our model can consistently outperform state-of-the-art
diffusion models by a significant margin, and approaches autoregressive models'
perplexity. We further show that, without any generation performance drop, our
framework offers a 1.3$\times$ sampling speedup over existing diffusion models.
Reproduced code is available at
https://github.com/MinkaiXu/Energy-Diffusion-LLM.
♻ ☆ Detection and Analysis of Offensive Online Content in Hausa Language
Hausa, a major Chadic language spoken by over 100 million people mostly in
West Africa is considered a low-resource language from a computational
linguistic perspective. This classification indicates a scarcity of linguistic
resources and tools necessary for handling various natural language processing
(NLP) tasks, including the detection of offensive content. To address this gap,
we conducted two set of studies (1) a user study (n=101) to explore
cyberbullying in Hausa and (2) an empirical study that led to the creation of
the first dataset of offensive terms in the Hausa language. We developed
detection systems trained on this dataset and compared their performance
against relevant multilingual models, including Google Translate. Our detection
system successfully identified over 70% of offensive, whereas baseline models
frequently mistranslated such terms. We attribute this discrepancy to the
nuanced nature of the Hausa language and the reliance of baseline models on
direct or literal translation due to limited data to build purposive detection
systems. These findings highlight the importance of incorporating cultural
context and linguistic nuances when developing NLP models for low-resource
languages such as Hausa. A post hoc analysis further revealed that offensive
language is particularly prevalent in discussions related to religion and
politics. To foster a safer online environment, we recommend involving diverse
stakeholders with expertise in local contexts and demographics. Their insights
will be crucial in developing more accurate detection systems and targeted
moderation strategies that align with cultural sensitivities.
comment: 21 pages, 4 figures, 7 tables
